SCHOLAR Study Guide SQA Advanced Higher Computing Unit 3c: Computer Architecture

David Bethune Heriot-Watt University Andy Cochrane Heriot-Watt University Ian King Heriot-Watt University

Interactive University Edinburgh EH12 9QQ, United Kingdom. First published 2005 by Heriot-Watt University This edition published in 2006 by Interactive University

Copyright ­c 2006 Heriot-Watt University Members of the SCHOLAR Forum may reproduce this publication in whole or in part for educational purposes within their establishment providing that no profit accrues at any stage, Any other use of the materials is governed by the general copyright statement that follows. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, without written permission from the publisher. Heriot-Watt University accepts no responsibility or liability whatsoever with regard to the information contained in this study guide.

SCHOLAR is a programme of Heriot-Watt University and is published and distributed on its behalf by Interactive University. British Library Cataloguing in Publication Data Interactive University SCHOLAR Study Guide Unit 3: Computing 1. Computing I. Computer Architecture ISBN 1 904647 89 8 Typeset by: Interactive University, Wallace House, 1 Lochside Avenue, Edinburgh, EH12 9QQ. Printed and bound in Great Britain by Graphic and Printing Services, Heriot-Watt University, Edinburgh.

Part Number 2006-1385 Acknowledgements

Thanks are due to the members of Heriot-Watt University’s SCHOLAR team who planned and created these materials, and to the many colleagues who reviewed the content. Programme Director: Professor R R Leitch Series Editor: Professor J Cowan Subject Directors: Professor P John (Chemistry), Professor C E Beevers (Mathematics), Dr P J B King (Computing), Dr P G Meaden (Biology), Dr M R Steel (Physics), Dr C G Tinker (French) Subject Authors: Biology: Dr J M Burt, Ms E Humphrey, Ms L Knight, Mr J B McCann, Mr D Millar, Ms N Randle, Ms S Ross, Ms Y Stahl, Ms S Steen, Ms N Tweedie Chemistry: Mr M Anderson, Mr B Bease, Dr J H Cameron, Dr P Johnson, Mr B T McKerchar, Dr A A Sandison Computing: Mr I E Aitchison, Dr P O B Holt, Mr S McMorris, Mr B Palmer, Ms J Swanson, Mr A Weddle Engineering: Mr J Hill, Ms H L Jackson, Mr H Laidlaw, Professor W H Müller French: Mr M Fermin, Ms B Guenier, Ms C Hastie, Ms S C E Thoday Mathematics: Mr J Dowman, Ms A Johnstone, Ms O Khaled, Mr C McGuire, Ms J S Paterson, Mr S Rogers, Ms D A Watson Physics: Mr J McCabe, Mr C Milne, Dr A Tookey, Mr C White Learning Technology: Dr W Austin, Ms N Beasley, Ms J Benzie, Dr D Cole, Mr A Crofts, Ms S Davies, Mr A Dunn, Mr M Holligan, Dr J Liddle, Ms S McConnell, Mr N Miller, Mr N Morris, Ms E Mowat, Mr S Nicol, Dr W Nightingale, Mr R Pointon, Mr D Reid, Dr R Thomas, Dr N Tomes, Ms J Wang, Mr P Whitton Cue Assessment Group: Ms F Costigan, Mr D J Fiddes, DrDHJackson, Mr S G Marshall SCHOLAR Unit: Mr G Toner, M G Cruse, Ms A Hay, Ms C Keogh, Ms B Laidlaw, Mr J Walsh Media: Mr D Hurst, Mr P Booth, Mr G Cowper, Mr C Gruber, Mr D S Marsland, Mr C Nicol, Mr C Wilson Administration: Ms L El-Ghorr, Dr M King, Dr R Rist,

We would like to acknowledge the assistance of the education authorities, colleges, teachers and students who helped to plan the SCHOLAR programme and who evaluated these materials. Grateful acknowledgement is made for permission to use the following material in the SCHOLAR programme: To the Scottish Qualifications Authority for permission to use Past Papers assessments. The financial support from the Scottish Executive is gratefully acknowledged.

All brand names, product names, logos and related devices are used for identification purposes only and are trademarks, registered trademarks or service marks of their respective holders.

i

Contents

1 Computer Structure 1 1.1 Prior Knowledge and Revision ...... 2 1.2 Introduction ...... 3 1.3 Computer Organisation ...... 3 1.4 Memory ...... 8 1.5 Why binary? ...... 20 1.6 The binary number system ...... 21 1.7 Conversion between number systems ...... 22 1.8 Hexadecimal Numbers ...... 23 1.9 Summary ...... 25 1.10 Revision Questions ...... 25

2 Processor Structure 27 2.1 Prior Knowledge and Revision ...... 28 2.2 Introduction ...... 29 2.3 The Processor ...... 29 2.4 Improving performance ...... 33 2.5 Summary ...... 41 2.6 Revision Questions ...... 42

3 Processor Function 45 3.1 Prior Knowledge and Revision ...... 47 3.2 Introduction ...... 48 3.3 Machine Code and Assembly Language ...... 48 3.4 Assembly language instructions ...... 49 3.5 The 6502 - a Case Study ...... 52 3.6 Running Assembly Language Programs ...... 79 3.7 Summary ...... 92 3.8 Revision Questions ...... 92

4 Processor Performance 95 4.1 Prior Knowledge and Revision ...... 97 4.2 Introduction ...... 98 4.3 RISC and CISC ...... 99 4.4 SIMD ...... 102 4.5 Pipelining ...... 103 4.6 Superscalar Processing ...... 110 4.7 Summary ...... 111 4.8 Revision Questions ...... 112 ii CONTENTS

5 Processor Development 115 5.1 Prior Knowledge and Revision ...... 117 5.2 Introduction ...... 117 5.3 The Intel X86 Series ...... 119 5.4 The PowerPC series ...... 128 5.5 The Intel IA-64 ...... 134 5.6 Parallel Computing ...... 138 5.7 Summary ...... 143 5.8 Revision Questions ...... 144

6 Operating System Functions 147 6.1 Prior Knowledge and Revision ...... 149 6.2 Introduction to Operating Systems ...... 150 6.3 Memory management ...... 153 6.4 Process management ...... 162 6.5 Input and output (I/O) ...... 167 6.6 File Management ...... 170 6.7 Summary ...... 184 6.8 Revision Questions ...... 185

7 Operating System Developments 187 7.1 Prior Knowledge and Revision ...... 189 7.2 The User interface ...... 189 7.3 Operating System Application Services ...... 196 7.4 User Access ...... 198 7.5 Comparing 2 Operating Systems ...... 200 7.6 Summary ...... 212 7.7 Revision Questions ...... 213

Glossary 214

Answers to questions and activities 218 1 Computer Structure ...... 218 2 Processor Structure ...... 220 3 Processor Function ...... 222 4 Processor Performance ...... 224 5 Processor Development ...... 225 6 Operating System Functions ...... 227 7 Operating System Developments ...... 230

­c HERIOT-WATT UNIVERSITY 2006 1

Topic 1

Computer Structure

Contents

1.1 Prior Knowledge and Revision ...... 2 1.2 Introduction ...... 3 1.3 Computer Organisation ...... 3 1.3.1 The components of a computer system ...... 3 1.3.2 Basic Architecture ...... 4 1.3.3 A typical simple computer system ...... 6 1.4 Memory ...... 8 1.4.1 Internal memory ...... 8 1.4.2 External Memory ...... 12 1.4.3 A memory hierarchy ...... 18 1.4.4 Memory and performance ...... 19 1.5 Why binary? ...... 20 1.5.1 Review questions 1 ...... 21 1.6 The binary number system ...... 21 1.6.1 Review questions 2 ...... 21 1.7 Conversion between number systems ...... 22 1.7.1 Review questions 3 ...... 23 1.8 Hexadecimal Numbers ...... 23 1.8.1 Review questions ...... 24 1.9 Summary ...... 25 1.10 Revision Questions ...... 25

Learning Objectives After studying this topic, you should be able to:

¯ represent any computer system as a 5-box diagram

¯ show how processor, main memory, peripherals and backing storage devices are connected by the data bus and address bus

¯ explain how whole numbers are represented in binary

¯ explain how binary numbers can be represented in hexadecimal

¯ convert between decimal, binary and hexadecimal representations of whole numbers 2 TOPIC 1. COMPUTER STRUCTURE

1.1 Prior Knowledge and Revision Before studying this topic, you should have studied Computer Systems to Higher level. In particular, you should be able to

¯ describe how numbers are stored in binary code in a computer

¯ describe the main types of backing storage used in computer systems, including

– magnetic discs (floppy discs, hard discs) – optical discs (CD, DVD) – magnetic tape

¯ describe the function of each of the following parts of a computer system

– processor, ALU, control unit and registers – data bus, address bus and control lines

However, we will review all of these important concepts in this topic, before exploring them in more depth.

Revision Q1: Binary numbers are used to store

a) numbers, but not other forms of data, within a computer system b) all types of data and instructions within a computer system c) any data within the processor, but not data being transferred to a peripheral device d) integers, but not real numbers, within a computer system

Q2: Which of the following backing storage devices fits this description - 700Mb of storage, can be re-used many times, data stored in magnetic form

a) hard disk b) CD-ROM c) CD-RW d) rewritable DVD

Q3: The ALU, control unit and registers are all found

a) within a computer’s memory b) in desktop computers, but not in laptops or hand-held computers c) within the processor of all types of computer d) in computer peripherals rather than in the computer itself

Q4: Data being transferred between processor and memory is transmitted along

a) the address bus b) control lines c) the data bus d) registers

­c HERIOT-WATT UNIVERSITY 2006 1.2. INTRODUCTION 3

Q5: Which of these lists is in increasing order of computing "power"? a) desktop, palmtop, mainframe b) , desktop, palmtop c) laptop, supercomputer, desktop d) palmtop, laptop, supercomputer

1.2 Introduction In this first sub-topic, you will review the basic structure of any computer system - how the processor, main memory, backing storage and input and output devices are all connected by the system bus (data bus and address bus). Next, you will review your knowledge of the binary number system, which is used by all computers to store and represent all forms of data - numbers, text, graphics, sound, machine code instructions. The binary number system is used because it is easily implemented using digital electronic devices. Although binary numbers are well suited to computers, they are very difficult for humans to read. For example, the phone number of Interactive University’s Edinburgh Office in binary, would be: 0000 0001 0011 0001 0011 0001 0111 0100 0000 0000 0000. Imagine trying to memorise that! Because of this, computer scientists use a system called hexadecimal to represent binary numbers. Hexadecimal is much more compact than binary, and is easily converted into binary when required. In this topic, you will learn enough about hexadecimal (or hex as it is usually called) to allow you to understand how processors work. The real new learning begins in topic 2, but it is important to ensure you have a basic understanding of the concepts in topic 1 before you go ahead into topic 2.

1.3 Computer Organisation 1.3.1 The components of a computer system Computers come in a wide range of shapes and sizes, from hand-held computers and embedded systems, through laptops and desktop PCs, to and mainframes.

­c HERIOT-WATT UNIVERSITY 2006 4 TOPIC 1. COMPUTER STRUCTURE

PDA Notebook

Desktop Server

Although these may seem to be very different, they can all be represented by the same 5 box diagram:

Backing storage

Input Main Memory Output devices Processor (+cache) devices

In this Unit, we will be exploring in more detail how the central "box", which includes the processor and main memory, functions.

1.3.2 Basic Architecture Figure 1.1 shows the main components of a simple computer and their interconnections.

­c HERIOT-WATT UNIVERSITY 2006 1.3. COMPUTER ORGANISATION 5

Control Memory

Processor Arithmetic Logical Unit

Registers

PC

IR Main External Peripheral Memory Memory Devices

Data Bus

Address Bus Internal Processor Bus External System Bus

Figure 1.1: The organisation of a simple computer Data and control signals are communicated between the main components of the computer via the system bus or external bus. A bus is a collection of parallel wires each of which can carry a digital signal. Thus a sixteen bit wide bus could transmit 16 bits simultaneously. The processor has its own internal bus to allow the communication of data and control signals between its component parts. It is worth noting here that the system bus contains lines to transmit data, lines to transmit memory addresses and various control lines. Frequently it is thought of as if it were separate buses: a data bus, an address bus and control lines. Sometimes the data and address lines are not separate at all but the same lines are used for different purposes at different times. For example, one moment sending an address to memory and the next transmitting data from the addressed memory location to the processor. The computer illustrated in Figure 1.1 is a typical example of a Von Neumann architecture, virtually all computers follow this architecture. This architecture has its origins in the stored program concept, which was proposed by John Von Neumann in 1945. The Stored Program Concept is the basic idea that the sequence of instructions (or program) to solve a problem should be stored in the same memory as the data. This idea ensured that the computer became a general-purpose problem-solving tool since to make it solve a different problem required only that a different program be placed in memory. The component parts of the computer are:

¯ Processor. Carries out computation and has overall control of the computer.

¯ Main memory. Stores programs and data while the computer is running. Is fast access, directly accessible by the processor, limited in size and non-permanent.

­c HERIOT-WATT UNIVERSITY 2006 6 TOPIC 1. COMPUTER STRUCTURE

¯ External memory. Holds quantities of information too large for storage in main memory. Slower access than main memory, not accessible directly by the processor but can be used to keep a permanent copy of programs and data.

¯ Peripheral devices (input/output devices). These allow the computer to communicate with the outside world.

¯ External system bus. This allows communication of information between the component parts of the computer.

Some possible transfers of information via the system bus are:

¯ data is transmitted from main memory to the processor;

¯ input data from an external device (e.g. the keyboard) travels from the device to main memory;

¯ information from external memory is transmitted to main memory.

The speed of the system bus is very important since if it is too slow then the speed of the processor is restricted by having to wait for data. The processor typically consists of a Control Unit (CU) which exerts overall control over the operation of the processor, an Arithmetic and Logic Unit (ALU) which carries out computation, and a set of registers which can hold intermediate results during a computation. Two of these registers are of particular importance. The Program Counter (PC) holds the address in memory of the next instruction in the program. The Instruction Register (IR) holds the instruction currently being executed. These components are linked by an internal bus. In practice the architecture of a modern digital computer will be more complex than what has been described, with each component being itself an assembly of parts connected by various different buses. However for the moment this will suffice as a model for how the major parts of a digital computer are organised.

1.3.3 A typical simple computer system Figure 1.2 is a schematic diagram of a typical microcomputer system built around the 6502 microprocessor.

­c HERIOT-WATT UNIVERSITY 2006 1.3. COMPUTER ORGANISATION 7

8 Bit Data Bus

I/O Bus XTAL Processor ROM RAM PIO I/O Devices I/O Bus CLOCK

16 Bit Data Bus

Control Lines

Figure 1.2: A simple 6502 based system The 6502 microprocessor (introduced in about 1975) was among the first microprocessors to be used in early home computers. It was used in early Commodore, Apple and Atari home computers and was also the processor used in the BBC Micro. The 6502 processor included the usual Arithmetic/Logic Unit with some internal registers and a Control Unit all on the same chip. It had an external crystal-controlled clock to generate timing signals. Read-Only Memory (ROM) was used to hold a bootstrap program to permit initial operation of the system. Random Access Memory (RAM) was used to hold programs and data. The interface with the external devices was via a Programmable Input/Output unit (PIO), which communicated with the external devices using 16-bit wide I/O buses. The external bus was a combination of an 8-bit-wide data bus, a 16-bit-wide address bus and some control lines that carried synchronisation signals throughout the system. Hence only 8 bit data could be moved around the system, but 16-bit addresses could be used to address memory. Memory locations with addresses in the range 0 to 65535 (216-1) could be directly addressed. Memory was made up of individually addressable 8-bit words (bytes). The 6502 was a typical design of its period. Early models ran at a clock-rate of 1 MHz but some later processors ran at 6 MHz. However with its limitation to an 8-bit-wide data bus and 8-bit arithmetic with no multiply or divide operations, it could not compete with later 16- and 32- bit microprocessors. The beginning of the era that led to the very fast personal computers of today was signalled by the arrival of the Intel 8086 series of processors. These processors were introduced in 1978 and were used in the early IBM PC. Systems that used these processors had a layout similar to our general model. However the processors were full 16-bit processors with 16-bit arithmetic, including multiply and divide operations and logical operations. A 16-bit-wide bus was used both as an address bus and as a data bus. When it operated as an address bus an additional 4 bits of address were available by using four additional pins on the processor chip. Hence 16- bit data and 20-bit addresses could be transmitted. The 8086 had the capability to

­c HERIOT-WATT UNIVERSITY 2006 8 TOPIC 1. COMPUTER STRUCTURE

address 1 megabyte of memory since 220 = 1,048,576. The IBM PC using the 8086/8088 processors ran at a of 4.7 MHz and could execute some instructions in as little as 400 nanoseconds compared with the two microseconds of earlier 8-bit personal computers. This basic architecture has been extended through the introduction of successively more complex processors to the range. These include the 80286, 80386, 80486 and Pentium chips. We will study these in detail in a later topic.

1.4 Memory Before we study the processor itself in more detail, we will review and organise what you should already know about computer memory. As we have seen, memory is a key concept in the evolution and development of the modern computer. Historically, adding and calculating machines had no method of storing data or instructions, so they were very limited in what they could achieve. The invention of ways of storing data and instructions allowed computers to be developed that were multi-purpose programmable machines. Where and how can computers store data? Computers can store data in internal memory (within the computer itself) or external memory (backing storage).

1.4.1 Internal memory Internal memory is any memory device located within the computer system itself. This includes

¯ Registers (storage within the processor)

¯ Cache memory (may be in the processor, or just outside it but on the motherboard)

¯ Main memory (mostly RAM, in separate memory chips on the motherboard)

1.4.1.1 Registers

Registers are data storage locations within the processor itself. Each register can hold an item of data (which could be several bytes) which can be accessed almost instantaneously by the ALU. These registers hold the data that the processor is currently using, and the current machine code instructions. A typical processor may have hundreds of registers. A few of these have very specific purposes, and have names, including:

¯ the memory address register (MAR)

¯ the memory data register (MDR) - sometimes called the memory buffer register (MBR)

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 9

¯ the program counter (PC)

¯ the instruction register (IR)

We will study the exact purpose of each of these named registers later in this unit. The other registers within the processor are called general purpose registers, and can be used by the processor for storing and manipulating any data as required.

Internal Processor Contol Bus Unit

PC Address Bus MAR IR Memory Data Bus MDR

GP Registers

ALU Data Path

Control Path Processor

Figure 1.3: Typical microprocessor design The key facts about registers are

¯ they are part of the processor chip

¯ access to the data in them is almost instantaneous, measured in nanoseconds (10-9s)

¯ they are limited in number, anything from a few in early microprocessors up to a few thousand in more recent processors

¯ they are used to hold data temporarily which is needed immediately by the processor

­c HERIOT-WATT UNIVERSITY 2006 10 TOPIC 1. COMPUTER STRUCTURE

1.4.1.2 Main Memory

Main memory consists of specialised data storage chips which are either soldered on to the motherboard, or can be plugged into special slots on the motherboard. There are various types and configurations, including SIMM (single in-line memory modules) and DIMM (dual in-line memory modules), DRAM (dynamic random access memory), SRAM (static random access memory), and ROM (read only memory). Different computer systems will have different configurations and total amounts of main memory. Currently, typical microcomputers have 512 Mb of main memory. This is used to store programs and data currently being used by the processor. The main memory consists of individual memory locations, each able to store several bytes (typically 4 or 8 in current systems), and each with its own unique memory address, so that the processor can access any item of data directly.

Figure 1.4: Main memory The diagram below shows a typical motherboard layout. You can see the position of the main processor chip, soldered on RAM chips, slots for additional main memory, and cache memory chips.

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 11

Keyboard VGA Parallel Serial Mouse Game

Video PCI Power Controller Connector Power Supply Video Memory Connector VESA Riser Card Connector Connector RTC Module Video Upgrade Sockets DB Modem Connector J14 CPU CLK Super I/O J16 PW Controller J15 (MO/CLR) Floppy Drive Connector J18 WP ISA Hard Drive PCI to IDE Connector Controller J17 BB System Controller J19 CMOS CLR PCI Hard Drive J20 SETUP Connector Cache Memory VESA Extension

Main RAM

Main Processor SIMM Sockets (J5D1, J6D1) PCI/Cache/ Memory Controller

Auxillary Fan Connector

J37 SPK Connector

Figure 1.5: It is possible to add more main memory to most computer systems. Prices are changing

all the time. At the time of writing (2004), typical prices for extra memory chips are

° ° around °40 for 256 Mb, 70 for 512Mb or 150 for 1Gb. That works out at around 15p per Megabyte, which is quite expensive compared to many other forms of data storage. However, on-motherboard main memory is fairly quick to access compared to other forms of storage. Typical access time to get data from main memory to the processor would be in the range 10-100 nanoseconds, with data transfer rates of up to 3Gb per second for the latest RAM chips.

1.4.1.3 Cache Memory

The speeds quoted above for data access to main memory sound quite impressive. However, current processors are able to process data even faster than that! Unfortunately, they may be held up waiting for the transfer of data to or from main memory. One solution to this problem would be to increase the number of registers on the microprocessor itself, so that all data required would be instantaneously available to the processor. However, this solution is impractical, leading to over complex and large microprocessor chips.

­c HERIOT-WATT UNIVERSITY 2006 12 TOPIC 1. COMPUTER STRUCTURE

As a practical alternative, computer designers have developed a type of memory known as cache. Cache is simply very fast memory chips located on the motherboard very close to the processor. The system tries to ensure that data and instructions required immediately by the processor are transferred from main memory to the cache, from which the processor can access them quickly.

What do we mean by quickly? Currently, cache memory has access times of 10-50ns, with data transfer rates to the processor of over 10Gb per second. This is considerable faster than even the fastest main memory currently available, but comes at a price. At the time of writing, a 256Mb

cache upgrade would cost around °200, which is approximately 80p per Megabyte, significantly more than ordinary main memory. As a result, designers have to make compromises about the amount of cache provided. As we shall see, there are two main types of cache, but we will study these, and how cache operates, in a later topic.

1.4.2 External Memory All three memory types studied so far are internal to the computer system. This allows fast access, typically measured in nanoseconds, with high rates of data transfer, typically more than 1Gb per second. However, this type of data storage cannot be used as backing storage, as the data is lost if the computer is switched off. Backing storage is provided by external memory devices. There is a wide range of devices available, and the technology is continually improving. We will consider the following types:

¯ magnetic discs

¯ optical disc

¯ magnetic tape

1.4.2.1 Magnetic Disc

You should already be familiar with magnetic discs, their features and their uses. Floppy discs were the first type of magnetic removable discs. The early ones were large (8 inches diameter) and very floppy, hence the name. During the 1980s and 1990s, the technology improved, and floppy discs grew smaller in size, down to the standard 3.5 in disc, with a storage capacity of 1.4Mb. These were still floppy discs, although they were

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 13

protected by a hard plastic case, with a metal slider which only opened when slotted into a disc drive. With increasing file sizes, especially as a result of multimedia developments, the 1.4 Mb capacity made the floppy disc less and less useful. As a result, many hardware manufacturers, following the lead set by Apple in its iMac computers, have dropped floppy disc drives from their models. The big brother of the floppy disc, the hard disc, still continues to play an important role as the main backing storage device in most computer systems. Early microcomputers boasted hard discs with a capacity of 20Mb. Over the last two decades, hard discs have grown massively in capacity, while the prices have dropped. As a result, current desktop microcomputer systems typically have hard discs of 160Gb or more (an increase by a factor of 8000 in 20 years!). Removable hard discs are also available, which are not much bigger than the early floppy discs.

Figure 1.6: Inside of a hard disc unit At the time of writing, a typical 400Gb removable hard disc could be purchased for

around °200, which is around 0.05p per Megabyte, much cheaper than RAM. The main disadvantage of hard disc systems is that they are much slower than RAM. Access time could be as much as 10 ms (10,000,000 nanoseconds), and the data transfer rate is typically in the range 30-100Mb per second. This is much too slow for a typical processor, so hard disc is only suitable as backing storage. Currently active programs and associated data must be loaded into RAM for faster access.

Investigating a Hard Disc specification Find out the specification (price and capacity) of a currently available hard disc. If possible, find out its data transfer rate in Mb per second. 20 min Calculate the cost per Megabyte.

1.4.2.2 Optical Disc

Optical storage technology has developed rapidly since the early 1990s, and looks likely to replace magnetic technology for many applications. The most common optical storage device is the CD-ROM (Compact Disk Read-Only

­c HERIOT-WATT UNIVERSITY 2006 14 TOPIC 1. COMPUTER STRUCTURE

Memory). It is a read-only medium whose contents cannot be altered once data is written to it. During the mastering of a CD, a reasonably high powered laser is used to burn pits, typically 0.5 microns wide, 0.83 to 3 microns long and 0.15 microns deep, in a spiral, outwards from the centre of the disk. The disk is then coated with a protective layer of aluminium. If the spiral was straightened out then the data would stretch for four miles!! The CD player uses a lower powered laser to track the spiral as the disk turns. A spot of infrared light is shone onto the spiral to detect the data stored there. Pits on the disk surface encode digital data and signal the path of the spiral. This is illustrated in Figure 1.7

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 15

Signal Signal Disc Outer Inner Hole 120mm116mm 50mm 15mm

A 1.6mm - pitch Tracks B 1.2mm - space between tracks C 0.4mm - pit width

A

B C

Laser Pits form a spiral track starting near the centre.

Label Protective Coating 1.2mm Reflective film Transparent plastic

Laser 0.11mm pit Sensor

Figure 1.7:

­c HERIOT-WATT UNIVERSITY 2006 16 TOPIC 1. COMPUTER STRUCTURE

Recordable CDs (CD-Rs) The read-only limitation of CD-ROMs has been overcome by the creation of writeable CDs. The CD-R drive operates a laser light at one of 3 different levels. At low levels it detects the presence or absence of pits, while at the highest level it can burn data onto the surface. Recordable/Writeable CDs (CD-RW) The next stage in CD development was the introduction of CDs that erase current data and store new data. These operate using the highest power of laser to melt a small region of the recording layer. After freezing, the middle power laser is used to warm the surface to a temperature that is less than melting point but high enough to create the highly reflective crystalline form. The highest power is then used to write the data. Capacity CD-ROMs are typically 650 Mbytes in capacity. However, the future use of blue lasers, instead of red laser light, will deliver increased CD-ROM storage capacity. This is due to the shorter wavelength of the blue light compared to red light used in CD-ROM drives. Using a shorter wavelength results in smaller pits, allowing higher pit densities and more data storage. In later generations of CD-ROM we could see the capacity increase four- fold, from a mere 650 Mbytes to around 3 Gbytes. Speed In a single speed CD reader, data can be read at a rate of 150Kbytes/sec. This is sufficient for audio but is very slow for motion video or large image files. Multiple speed CD readers, such as 8xCD-ROM or 24xCD-ROM can read data at a rate of 1.2 Mbytes/sec and 3.6 Mbytes/sec respectively. Notice that these are multiples of the single-speed rate. Although 48xCD-ROMs are available, at data rates of 7.2 Mbytes/sec, they pale in comparison to SCSI hard disk drive rates. Access CD-ROM access is direct. Comparing optical discs with hard discs Optical storage has many advantages over magnetic storage, but current devices tend to be limited in capacity and have slower data transfer rates than hard discs. For example, the maximum capacity of a CD-ROM or CD-RW is 700Mb. DVD discs can store up to 4.7 or 9.4Gb, which is much more than a CD (enough to store 133 minutes of video on 1 side), but much less than a large hard disc. Optical storage devices tend to have lower data transfer rates than hard discs, although speeds are increasing all the time. The first CDs could transfer data at a rate of 150Kb per second, and specifications are given as a multiple of that. At the time of writing, the fastest CD drives were rated as 52X, which means 52 x 150Kbps, which is 7.8Mbps, significantly slower than hard discs data transfer rates. DVD data transfer rates are quoted as multiples of 1.3Mb per second, so a16X DVD has data transfer rate of 16 x 1.3Mbps, which is 20Mbps. This is closer to hard disc speeds, but still slower. Note that most manufacturers quote 3 speeds for a device. For example, A CD-RW drive might be rated as 32X16X40X. This means it can

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 17

¯ record data (for the first time) at 32 x 150Kbps

¯ rewrite data at 16 x 150Kbps

¯ read data (transfer it to the processor) at 40 x 150Kbps.

Investigating Optical Storage Devices Find out the current cost of a single CD-R, CD-RW, DVD-R and DVD-RAM disc. Express this as a cost per Megabyte. 30 min Find out and calculate the data transfer rate of current CD and DVD drives.

1.4.2.3 Magnetic tape

You are probably less familiar with magnetic tape as a backing storage method, as it is mainly used on large commercial systems or on network servers, rather than on personal microcomputers. There are many types of tape system available. Most record data on a helical track, on a tape which could be 4mm, 8mm or quarter of an inch wide.

Figure 1.8:

­c HERIOT-WATT UNIVERSITY 2006 18 TOPIC 1. COMPUTER STRUCTURE

Here are some of the types currently available, with typical capacities and data transfer rates:

Type of tape Capacity Maximum data transfer rate DAT 40Gb 5Mb/s QIC (DC 600 or DC 2000) Up to 2Gb 1.6Mb/s 8mm 14Gb 2Mb/s DLT 70Gb 2Mb/s Mammoth 40Gb 6Mb/s AIT 50Gb 6Mb/s

Investigating Magnetic Tape Choose any one of the tape systems listed in the table above. 20 min Find out the current capacity, maximum data transfer rate and cost. Calculate the cost per Megabyte. Uses of magnetic tape You will notice that magnetic tape is probably the cheapest form of data storage, and also the one with the lowest data transfer rate. For these reasons, it is not used for day to day running of systems, but it is ideal for making and keeping long term backups of commercial data, or backup copies of network hard drives.

1.4.3 A memory hierarchy Through the last topic, you have gathered a large amount of information about the various types of memory, both internal and external. It is useful to analyse this information, to see how these forms of storage compare with each other.

Constructing a comparison table of storage devices Copy and complete this table, using the information provided or researched in the 30 min previous topics:

Cost per Mbyte Capacity Access time Data transfer rate Internal memory types Registers Cache RAM External memory types Hard Disc CD-RW Rewritable DVD Magnetic tape

­c HERIOT-WATT UNIVERSITY 2006 1.4. MEMORY 19

Q6: How does cost per Mbyte change as you go down the table?

Q7: How does capacity change as you go down the table?

Q8: How does access time change as you go down the table?

Q9: How does data transfer rate change as you go down the table?

To think about... Are the data storage types listed in the "correct" order in the table above? Does it depend on which column you use as the criterion? Because it is possible to list data storage types in this way, it is often referred to as a memory hierarchy. Going down the list, the capacity increases, the access time increases, the data transfer rate decreases, and the cost per Megabyte decreases. Question Here is some information about CompactFlash (solid state) memory:

A 512Mb CF card currently costs °40. The data transfer rate is just over 50Mb per second, with an access time of around 10 milliseconds. Where should this go in the memory hierarchy?

1.4.4 Memory and performance From earlier studies, you should know that system performance depends on a number of factors. These include

¯ Clock speed

¯ Data bus width

¯ Type of processor

Memory also has an impact on system performance. Consider the following situation... A computer user has received a new piece of software on a magazine cover CD. To test the software, the user tries running the new application directly from the CD. It is slow to load and start running, because CD-ROM has a large access time and relatively low data transfer speed. To improve performance, the user installs the software on to his hard disc. Once installed, performance is much better, as the data transfer rate from hard disc is much faster. However, he notices that sometimes there is a pause during the running of the software, and the hard disc can be heard running. This may be because there is not enough main memory (RAM) to hold the whole program and its related data, so that sometimes data has to be fetched from the hard disc into RAM when required. The user likes the new software, but finds the constant pauses while it is running, to be a great nuisance. He buys some extra RAM, and plugs it into the motherboard. Now there is enough RAM for the whole program and its associated data to be held in RAM

­c HERIOT-WATT UNIVERSITY 2006 20 TOPIC 1. COMPUTER STRUCTURE

at the same time, so there are no more pauses while further data is fetched from the hard disc. He notices that his friend is also using the same software, but that it seems to run faster on his friend’s computer. They compare the computer specifications: both have the same type and speed of processor, and the same total amount of RAM. Why the difference in performance? They read the computer specifications carefully. The only difference is that his friend’s computer has more cache memory. This may account for the improved performance, as more of the data can be accessed quickly from cache rather than from RAM ... The description above is a little simplistic. There are many other factors which can affect computer performance, and they all interact in subtle ways, depending on the type of application that is being used at the time. However, the amount and type of memory is one of the important factors that can influence performance. Basically, to improve performance, designers will attempt to

¯ develop processors with more internal registers

¯ increase the amount of cache available to the processor

¯ increase the amount of main memory (RAM)

¯ use faster external storage devices where possible

However, all of these increase costs, so the final design of any system will be a compromise between cost and performance.

Q10: Arrange these types of memory in a list in order of decreasing price per megabyte: DVD RAM Hard Disc CD Cache Tape Register Q11: Would the order change if they were listed in order of decreasing data transfer rate?

1.5 Why binary? All the logic circuits used in digital computers are based upon two-state logic. That is, quantities can only take one of two values, typically 0 or 1. These quantities will be represented internally by voltages on lines, zero voltage representing 0 and the operating voltage of the device representing 1. The reason two-state logic is used is because it is easy and economic to produce such devices. The computer represents all information internally, both instructions and data, by patterns of 0s and 1s. Each individual 0 or 1 is held in a bit, and memory and registers within the computer are implemented as collections of bits. In memory the smallest addressable quantity is the byte, which is eight bits. While the bits are often interpreted as truth values (true or false), collections of bits are used to represent numbers and characters. Numbers are represented by some version of the binary number system, which only has the two digits 0 and 1. Characters are represented by specific patterns

­c HERIOT-WATT UNIVERSITY 2006 1.6. THE BINARY NUMBER SYSTEM 21

of 0s and 1s in a byte, using some encoding such as ASCII (American Standard Code for Information Interchange) or in two bytes using the Unicode standard.

1.5.1 Review questions 1 Q12: Why do computers use a binary representation to store information? Q13: How would a character in the ASCII character set be represented inside a computer?

1.6 The binary number system The binary system is an arithmetic system using a base of 2, unlike the decimal system which uses the base 10. It is a positional number system similar to the decimal system where the value of a digit depends upon its position in the number.

For example, the decimal number 121 represents

¾ ½ ¼

½ ¢ ½¼ ·¾ ¢ ½¼ ·½ ¢ ½¼ and the 1 represents the value 100 in one position and represents 1 in another. The

same number in binary form would be 1111001, which represents

   ¿ ¾ ½ ¼

½ ¢ ¾ ·½ ¢ ¾ ·½¢ ¾ ·½ ¢ ¾ ·¼¢ ¾ ·¼ ¢ ¾ ·½ ¢ ¾

Fractions are represented in similar fashion using negative powers of 10 (decimal) or 2

(binary). For example:

½ ¼ ½ ¾

½¿½ ½¢ ½¼ ·¿ ¢ ½¼ ·½ ¢ ½¼ · ¢ ½¼ ½¼

and

¾ ½ ¼ ½ ½

½¼½½½ ½¢ ¾ ·¼¢ ¾ ·½ ¢ ¾ ·½¢ ¾ ·½ ¢ ¾ 

¾ ½¼ where the notation xb means x expressed in a base b number system. The decimal number system requires the basic ten digits 0, 1, ..., 9 to express any number, while the binary system only requires the two digits 0 and 1.

1.6.1 Review questions 2 Q14: Convert the binary integer 1001 to decimal. Q15: Convert the binary integer 0110 to decimal. Q16: Convert the binary integer 1101 to decimal. Q17: Express the decimal integer 5 as a 4-bit binary number. Q18: Express the decimal integer 87 as an 8-bit binary number.

­c HERIOT-WATT UNIVERSITY 2006 22 TOPIC 1. COMPUTER STRUCTURE

Q19: How would the decimal fraction 3/4 be expressed in binary?

Q20: What is the largest integer expressible in 4 bits? Give your solution in decimal.

Q21: Add 1 to the number you wrote down in the last question, what is this number expressed as a power of 2? Now you should be able to write down the largest number that can be held in 4 bits in terms of a power of 2. Can you generalise this to the largest number representable in n bits?

1.7 Conversion between number systems Conversion from a binary number to a decimal is straightforward, all that is required is to multiply each binary digit by the appropriate power of two and add the results. This method is easily extended to convert numbers expressed in a base b number system into decimal.

For example

¿ ¾ ½ ¼ ½

½½¼½½ ½¢ ¾ ·½ ¢ ¾ ·¼ ¢ ¾ ·½¢ ¾ ·½ ¢ ¾ ½¿

¾ ½¼

Converting from Binary to Decimal On the Web is an animation which shows you how a number is converted from binary to 10 min decimal. Converting from decimal to binary is less straightforward, integer and fractional parts have to be treated differently. Converting a decimal integer to binary requires successive divisions of the integer by two, keeping note of the remainders, until the result is zero. These remainders are the binary digits representing the integer, least significant digit first. This process is illustrated as follows, which produces the binary number equivalent to the decimal integer 79. divide by 2 remainder 79 39 1 39 19 1 19 9 1 9 41 420 210 101

showing that the binary equivalent of 7910 is 10011112. Note that this process always terminates because every decimal integer can be represented exactly by a finite length binary integer.

­c HERIOT-WATT UNIVERSITY 2006 1.8. HEXADECIMAL NUMBERS 23

Decimal to Binary Conversion On the Web is an interaction which shows the conversion of a decimal number to binary. You should now look at this interaction.

1.7.1 Review questions 3 Q22: Convert the binary number 1101 to decimal.

Q23: Convert the binary number 110101 to decimal.

Q24: Convert the decimal number 19 to binary.

Q25: Convert the decimal number 81 to binary.

1.8 Hexadecimal Numbers It is very inconvenient for anyone who has to assign particular bit-patterns as values inside the computer to use the binary notation. They could use the equivalent decimal value but there is the difficulty of converting between binary and decimal. For this reason hexadecimal notation is often used because conversion between binary and hexadecimal is easier. Hexadecimal is a number system with a base of sixteen. Hence sixteen hexadecimal digits are required. These digits are defined below, the first 10 being the normal decimal digits 0 to 9 and the remainder being the letters A to F. Their binary equivalents are also

given.

¼¼¼¼  ¼ ½¼¼¼  

¼¼¼½  ½ ½¼¼½  

¼¼½¼  ¾ ½¼½¼  

¼¼½½  ¿ ½¼½½  

¼½¼¼  ½½¼¼ 

¼½¼½  ½½¼½ 

¼½½¼  ½½½¼  

¼½½½   ½½½½  

A hexadecimal integer can easily be converted to decimal by straight expansion as in

the method used for binary to decimal. For example:

½ ¼

¾ ¢ ½ · ¢ ½ ¾

½ ½

C ½ C

½ ¼

¾ ¢ ½ ·½¾ ¢ ½

½¼ ½¼

 ½¼

­c HERIOT-WATT UNIVERSITY 2006 24 TOPIC 1. COMPUTER STRUCTURE

Decimal integers can be converted to hexadecimal by using the division method used for converting decimal to binary, but dividing by 16 rather than 2. For example to convert 17310 to hexadecimal: divide by 16 remainder 173 10 13 (D) 10 0 10 (A)

Hence 17310 is AD16 Converting hexadecimal to binary or vice-versa is very easy. In converting hexadecimal to binary, the hexadecimal digits are merely replaced by their four-bit binary equivalents. In converting binary to hexadecimal, the binary number is split into 4-bit sections from the least significant digit and each 4-bit section is converted to its hexadecimal equivalent. Normally of course the word lengths used in computers are multiples of four, in fact they are usually 1, 2, 4 or 8 bytes long. For example:

1011 1100 1001 00112 = BC9316

1DE916 = 0001 1101 1110 10012 If a large decimal integer has to be converted to binary then it is easier to first convert it to hexadecimal and then to binary. For example to convert 7893 to binary: divide by 16 remainder 7893 493 5 493 30 13 30 114 1 0 1 Hence

789310 = 1ED516 = 0001 1110 1101 01012

Advantages of Hexadecimal On the Web is an activity that asks you to identify the advantages of hexadecimal 10 min notation. You should now do this activity.

1.8.1 Review questions Q26: Convert the binary integer 11101011 to hexadecimal.

Q27: Convert the binary integer 01011111 to hexadecimal.

Q28: Convert the hexadecimal integer 2A to binary and hence give its decimal value.

Q29: Convert the hexadecimal integer CD to binary and hence give its decimal value.

­c HERIOT-WATT UNIVERSITY 2006 1.9. SUMMARY 25

1.9 Summary

¯ any computer system can be represented as a 5-box diagram - processor, main memory, backing storage, input and output devices

¯ the processor, main memory, peripherals and backing storage devices are connected by the data bus and address bus

¯ memory can be arranged into a hierarchy of decreasing cost per Megabyte, increasing access time and increasing capacity in this order: registers, cache, RAM, hard disc, optical devices, magnetic tape

¯ any whole number can be represented in binary as a series of 1s and 0s (bits)

¯ all data within a computer system is represented and stored in binary number format

¯ any group of 4 bits can be represented as a single hexadecimal digit (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F)

¯ hexadecimal is NOT used by computers, but is a useful way for humans to write binary numbers

1.10 Revision Questions Q30: The system bus is a) a group of wires which carry data between the processor, memory and peripheral devices b) a group of wires which carry data within the processor c) is a single wire that carries data between the processor and memory d) a cable used to connect a computer to a peripheral device

Q31: Where does cache fit in the memory hierarchy? a) above registers b) between RAM and hard disc c) between registers and RAM d) between magnetic tape and magnetic disc

Q32: As you go up the memory hierarchy from magnetic tape towards registers a) storage capacity increases b) cost per Megabyte decreases c) access time increases d) data transfer rate increases

Q33: The decimal number 27 could be represented by the binary number a) 1B b) 1101100 c) 0011011 d) 0011010

­c HERIOT-WATT UNIVERSITY 2006 26 TOPIC 1. COMPUTER STRUCTURE

Q34: The binary number 00101100 would be represented in hexadecimal by

a) C2 b) 2C c) 0230 d) 212

Q35: The hexadecimal digit F represents

a) 1111 in binary, which is 15 in decimal b) 10000 in binary, which is 16 in decimal c) the ASCII code for the letter F d) the number four

Q36: Hexadecimal is

a) a base 16 code used to write binary numbers in a shorthand way b) a base 16 code used by some computers rather than binary c) a base 6 code used by computers to represent text characters d) a code used in processors to represent machine code instructions

­c HERIOT-WATT UNIVERSITY 2006 27

Topic 2

Processor Structure

Contents

2.1 Prior Knowledge and Revision ...... 28 2.2 Introduction ...... 29 2.3 The Processor ...... 29 2.3.1 Basic Processor Architecture ...... 29 2.3.2 Processor Registers ...... 30 2.3.3 The fetch-execute cycle ...... 31 2.3.4 Reviewing the Fetch-Execute cycle ...... 33 2.4 Improving performance ...... 33 2.4.1 Increasing clock speed ...... 34 2.4.2 Increasing data bus width ...... 35 2.4.3 PCI and PCI-X buses ...... 36 2.4.4 Use of cache memory ...... 37 2.4.5 Memory interleaving ...... 39 2.4.6 Direct Memory Access (DMA) ...... 40 2.5 Summary ...... 41 2.6 Revision Questions ...... 42

Learning Objectives After studying this topic, you should be able to:

¯ describe the main registers to be found in a processor

¯ describe the relationship between these registers and the data and address buses

¯ describe the fetch-execute cycle in terms of buses and registers

¯ explain how processor performance can be improved by – increasing clock speed – increasing data bus width – use of cache memory – memory interleaving – direct memory access

¯ describe the development of PCI and PCI-X buses 28 TOPIC 2. PROCESSOR STRUCTURE

2.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ describe the function of each of the following parts of a computer system

– processor, ALU, control unit and registers – data bus, address bus and control lines

¯ list the following named registers to be found in typical microprocessors

– memory address register (MAR) – memory data/buffer register (MDR/MBR) – program counter (PC) – instruction register (IR)

Revision Q1: Registers are

a) memory locations within main memory b) data storage locations within a processor c) lists of files stored on a backing storage device d) data lines linking devices within a computer system

Q2: Which processor component manages the execution of instructions?

a) the arithmetic and logic unit (ALU) b) the instruction register (IR) c) the control unit d) the data bus

Q3: The address bus is used

a) to transfer data to the processor from memory b) to send read and write instructions to memory c) to send addresses from processor to memory d) to send addresses from memory to processor

Q4: The memory buffer register (MBR) is also know as

a) the memory address register (MAR) b) the instruction register (IR) c) the program counter (PC) d) the memory data register (MDR)

Q5: Which of these statements is correct?

a) the ALU is part of the control unit b) the control unit is part of the ALU c) the control unit consists of an ALU and registers d) the ALU, control unit and registers are parts of the processor

­c HERIOT-WATT UNIVERSITY 2006 2.2. INTRODUCTION 29

2.2 Introduction In the first topic, we studied the basic architecture of a typical computer system, reviewing the main sub-systems within it - the processor, main memory, cache memory and external storage devices. In this topic, we will take a closer look at the processor, and how it communicates with other devices.

2.3 The Processor 2.3.1 Basic Processor Architecture We will now study the internal architecture of the microprocessor itself. Because of the stored program concept any consideration of this architecture must consider the relationship between the processor and memory. Figure 2.1 is a schematic diagram of a fairly typical microprocessor design, showing the internal structure of the processor and its relationship to the Random Access Memory of the computer.

Internal Processor Contol Bus Unit

PC Address Bus MAR IR Memory Data Bus MDR

GP Registers

ALU Data Path

Control Path Processor

Figure 2.1: Typical microprocessor design In Figure 2.1, you can see the main components of the processor. These are

¯ the control unit

¯ the ALU

¯ registers

¯ the internal processor bus

Q6: Which processor component manages the execution of instructions?

­c HERIOT-WATT UNIVERSITY 2006 30 TOPIC 2. PROCESSOR STRUCTURE

Q7: Which processor component processes and manipulates data? Q8: Which processor component stores data and instructions temporarily? Q9: Which processor component transfers data and instructions within the processor?

2.3.2 Processor Registers Within any processor, there are many registers. The actual number of registers depends on the processor design. Some of the simplest, earliest processors had only a few registers, while more modern designs may have hundreds of registers. Some of these registers have a particular purpose, and are to be found in almost every design of processor. The memory data register (MDR) is connected to the data bus, so any data transferred to the processor from anywhere else in the system, arrives in the MDR. It is like the arrival platform of a railway station. Once safely stored in the MDR, the data can then be moved along the internal bus to anywhere else within the processor. The MDR is also the departure platform. Any data which is to be sent out from the processor, is sent out from the MDR along the data bus. The MDR forms a tiny buffer between the internal bus and the data bus, so it is also known as the memory buffer register (MBR). Why memory? The reason is that most data transfers are either to or from memory. When data is being sent to memory, or fetched from memory, the processor must ensure that the data is sent to or fetched from, the correct memory location. It does so by sending the address out along the address bus. The memory address register (MAR) is the departure platform for any address to be sent out along the address bus. Note that addresses never arrive in the MAR from outside the processor. The address bus is a one-way bus, unlike the data bus which can carry data in either direction. Very often, the data that is being fetched from memory will be a machine code instruction. This arrives in the MDR like any other item of data. It is then transferred by the internal bus to the instruction register (IR), where it is held while being decoded by the control unit. At any instant, the IR holds the machine code instruction which is currently being decoded and executed. A machine code program consists of a series of machine code instructions, held in main memory. These are fetched one by one from memory. How does the processor know where to find the net instruction to be processed? It holds the address of the next instruction in a special register called the program counter (PC). These 4 registers, or their equivalents, are to be found in every type of processor. In addition, there will be many general purpose registers (GP registers), which, as their name implies, can be used to store any item of data at any time as required by the current program running in the processor. You can see all of these registers in this schematic diagram of a typical microprocessor (Figure 2.1).

Q10: Which register holds the instruction currently being decoded and executed? Q11: Which register holds the address of the next instruction to be fetched?

­c HERIOT-WATT UNIVERSITY 2006 2.3. THE PROCESSOR 31

Q12: Which register can hold any item of data required?

Q13: Which register holds data about to be sent out along the data bus?

Q14: Which register holds the address of a memory location about to be accessed?

2.3.3 The fetch-execute cycle To execute a machine code program it must first be loaded, together with any data that it needs, into main memory (RAM). Once loaded, it is accessible to the CPU which fetches one instruction at a time, decodes and executes it. Fetch, decode and execute are repeated until a program instruction to HALT is encountered. This is known as the fetch-execute cycle.

2.3.3.1 Fetch execute cycle in greater detail

Using the fetch-execute cycle, machine code instructions are repeatedly transferred from main memory to the CPU for execution. Now we will study how the address bus, data bus, control bus and internal registers take part in reading a program instruction from main memory - essentially the fetch phase of the fetch-execute cycle. Figure 2.2 below illustrates in more detail the fetch-execute cycle.

Figure 2.2: Fetch-execute cycle

­c HERIOT-WATT UNIVERSITY 2006 32 TOPIC 2. PROCESSOR STRUCTURE

To execute a program, the start address of the program in memory is loaded into the Program Counter. The succeeding steps are:

1. Fetch. The instruction is fetched from the memory location whose address is contained in the Program Counter and placed in the Instruction Register. The instruction will consist of an operation code and possibly some operands. The operation code determines which operation is carried out. The term opcode is usually used as a shorthand for operation code.

2. Decode. The pattern of the opcode is interpreted by the Control Unit and the appropriate actions are taken by the electronic circuitry of the CU. These actions may include the fetching of operands for the instruction from memory or from the general purpose registers.

3. Increment. The Program Counter is incremented. The size of the increment will depend upon the length of the instruction in the IR. For example, if this instruction was a 3 byte instruction then the PC would be incremented by 3.

4. Execute. The instruction in the Instruction Register is executed. This may lead to operands being sent to the ALU for arithmetic processing and the return of the result to one of the general purpose registers.

When a HALT instruction is received then execution of this program ceases.

2.3.3.2 The fetch phase

The first of the two main phases of the fetch-execute cycle is the fetch phase, and consists of the following steps:

1. The contents of the PC are copied into the MAR;

2. The contents of memory at the location designated by the MAR are copied into the MDR;

3. The PC is incremented;

4. The contents of the MDR are copied into the IR.

Remember that the PC is used to keep track of where execution has reached. Thus the first step is concerned with establishing the location of the next instruction to execute. The second step is to get the value into the MDR. The third step is to ensure that the PC points to the next instruction to be executed: if we did not increment the PC at some point, we would continually execute the same instruction over and over again!. The fourth step ensures that there is a copy of the instruction in the IR ready for execution to begin.

­c HERIOT-WATT UNIVERSITY 2006 2.4. IMPROVING PERFORMANCE 33

2.3.3.3 The execute phase

The execute phase consists of the following steps:

1. Decode the instruction in the IR; 2. Execute the instruction in the IR.

Once the execute phase has completed, the fetch phase will be carried out again. Pseudocode representation of the fetch-execute cycle

For convenience we can write this series of steps as a pseudocode representation:

ÐÓÓÔ ÓÖÚÖ

È ¹ ÅÊ

ÅÊ ¹ Å Ê

È ·½ ¹ È

Å Ê ¹ ÁÊ

!Ó" ÁÊ

Ü!ÙØ ÁÊ

Ò" ÐÓÓÔ

 ÅÊ Note that ¹ means is copied to and that means the contents of the location

pointed to by ÅÊ.

Use of Registers in an Instruction Fetch On the Web is an animation which shows you how the internal registers and buses are used during the fetch of an instruction from main memory. You should now look at this animation.

2.3.4 Reviewing the Fetch-Execute cycle Sequencing the steps in an instruction fetch On the web is an assessment that requires you to place the steps of an instruction fetch in the correct order. You should now carry out this assessment.

Q15: What are the main steps in the fetch-execute cycle? Q16: At what stages in the fetch-execute cycle are the Memory Address Register and the Memory Data Register used? Q17: What would happen if a program contained no HALT instruction?

2.4 Improving performance Computer and microprocessor designers are driven by the need to improve computer performance to meet the ever increasing demands of computer users. When the first

­c HERIOT-WATT UNIVERSITY 2006 34 TOPIC 2. PROCESSOR STRUCTURE

computers were developed, programs were relatively short and simple, and were used to manipulate text and numbers only. The advent of graphical user interfaces (GUIs) in the 1980s required processors to handle much larger amounts of data at a speed acceptable to the user. Since then, even desktop microcomputers have been developed that can handle high resolution graphics, sound and video in real time. In this topic, we will study some of the techniques used to improve the performance of computer systems.

2.4.1 Increasing clock speed One of the prime factors affecting the performance of microprocessors is the clock speed at which they run. Every processor has an internal clock which generates pulses at a regular rate. These pulses are used to synchronise the steps carried out by the microprocessor while carrying out the fetch-execute cycle. All processor activities will start on a clock pulse, for example fetching an instruction, placing data in the Memory Data Register, transferring an operand from a general-purpose register to the ALU, etc. The time between pulses is the cycle time. Early microprocessors had clock speeds measured in kHz (thousands of cycles per second) while modern processors such as the Pentium III are now achieving speeds of over 1 GHz (thousand million cycles per second). Obviously clock speed is an important factor in determining the performance of a microprocessor. A microprocessor running at 200 MHz is likely to execute instructions faster than one which runs at 100 MHz. However when it comes to judging performance between competing processors clock speed may not always be a reliable measure of relative performance. One of the reasons for this is that an instruction such as an add may take several cycles, the number of cycles required increasing with the complexity of the addressing method used for the operands. Table 2.1 tabulates the clock speed versus the performance of Intel processors as measured in Million Instructions per Second (MIPS). MIPS is now an outdated way to measure performance but it is the only measure applicable over the whole range.

Table 2.1: Clock speed versus performance Intel Processor Clock Speed MIPS 8086 8 MHz 0.8 80286 12.5 MHz 2.7 80386DX 20 MHz 6.0 80486DX 25 MHz 20 Pentium 60 MHz 100 Pentium Pro 200 MHz 440 This table shows that the performance as measured by MIPS has gone up at a higher rate than has the clock rate. For example, on these figures the average number of clock cycles required for an instruction in the Intel 8086 processor was about 10 while in the Pentium Pro about two instructions per clock cycle are achieved. We shall look later at why this has happened.

­c HERIOT-WATT UNIVERSITY 2006 2.4. IMPROVING PERFORMANCE 35

Determining the latest clock speeds Table 2.1 shows the clock speeds and resultant MIPs from the 8086 (of year 1978) through the 80286 (1982), 80386DX (1985), 80486DX (1989), Pentium (1993) to the 30 min Pentium Pro (1995). Copy and complete the table for more recent Intel processors, including the Pentium 2 (1997), the Pentium 3 (1999), the Pentium 4 (2000), the Itanium (2001), and the Pentium M (2003). You can find data at http://www.intel.com/pressroom/kits/quckreffam.htm (note if you have difficulty accessing this link goto ww.intel.com and enter "quckreffam" in the search box.) Use a spreadsheet program to create a graph of clock speed against date. Predict the clock speed which you would expect in 5 years time.

Processor performance at different clock speeds On the Web is an animation that demonstrates the difference in performance of two processors operating at different clock speeds. You should now look at this animation.

2.4.2 Increasing data bus width Another way to increase performance is to increase the data bus width. One of the bottlenecks within any system is the rate at which data can be fetched from memory to the processor (and vice versa). Increasing the clock speed will increase the number of data fetches that can be made per second. Increasing the data bus width will increase the amount of data that can be fetched each time. The earliest processors and motherboards had a data bus width of only 4 bits. It took 2 fetches to fetch a byte from memory to the processor. The Intel 8008 processor (1972) used an 8 bit data bus. Clearly the internal registers (particularly the MDR) had to match this. In 1978, the Intel 8086 was developed which used a 16 bit data bus and set of internal registers. This gave huge improvements in performance, and allowed the development of the first PCs. However, by 1985, Intel decided to increase the data bus width and internal registers of its processor again, so the 80386 was produced with a 32 bit data bus. 32 bits was the norm for the next 10 years, until the first 64 bit Pentium chip was introduced in 1995. All PC designs since then have made use of 64 bit technology. The figures above refer to the Intel range of microprocessors, which form the basis of all Windows-based PCs. A similar development has taken place in the Motorola chips which are used in Apple computers, from the early 68000 16-bit architecture through to the current G5 64-bit architecture. When will we see the first 128 bit chip? Note that the width of the address bus has no direct effect on performance. The width of the address bus determines the maximum memory address, and therefore the maximum amount of main memory that can be addressed. Generally speaking, address bus widths have also increased steadily over the last few decades from 16-bit to 32-bit, and now 64-bit.

­c HERIOT-WATT UNIVERSITY 2006 36 TOPIC 2. PROCESSOR STRUCTURE

2.4.3 PCI and PCI-X buses The earliest computers had a single system bus connecting the processor with the main memory and peripheral interfaces. This system bus operated at the same speed as the processor. Since then, bus technology has developed and become more complex. We have already seen that the data bus width has been stepped up from 8 to 16, 32 and now 64 bits wide. The number of different components within a system has also increased. A modern processor is likely to be connected to a range of peripherals as well as main memory. Generally speaking, these peripherals operate at lower speeds than the processor and main memory. As a result, designs have developed with multiple buses within the system - a very fast "frontside" system bus for main memory, and a slower bus for communication with peripheral devices. There have been a number of peripheral buses. The main ones are listed in Table 2.2:

Table 2.2: Bus type Width Speed Data transfer rate ISA 16 bit 8 MHz 16 MBps EISA 32 bit 8 MHz 32 MBps VL-bus 32 bit 25 MHz 100 MBps PCI 32 bit 33 MHz 132 MBps PCI 64 bit 33 MHz 264 MBps PCI-X 64 bit 133 MHz 1 GBps The PCI and PCI-X buses are connected to the main system bus by a bus bridge, which controls the traffic on and off the bus. The PCI and PCI-X buses are known as multipoint buses; this means they can have branches to any number of devices. The main system bus is also a multipoint bus. The so-called "backside bus" connecting the processor to cache memory is a point-to-point bus. The relationship between the various components and buses is shown in Figure 2.3.

­c HERIOT-WATT UNIVERSITY 2006 2.4. IMPROVING PERFORMANCE 37

Backside Bus

Level 2 System Memory Cache

Memory Processor Controller RAM RAM RAM RAM

Frontside Bus

Bus Bridge Graphics AGP Card Chipset

Bus ISA Bus PCI Bus Bridge

CD Drive Hard Disc

Modem Mouse

USB Hub Ethernet Card

Figure 2.3: The separation of (relatively slow) peripheral traffic on to the PCI bus means that fast data transfers between main memory and the processor are not slowed down. The PCI-X bus, as well as being faster and wider that the original PCI bus, also has a number of special features to maximise performance. These include the prioritisation of data from different devices, which particularly improves performance of streaming audio and video. You can read more abut PCI and PCI-X technology at http://www.extremetech.com/article2/0,1558,1154801,00.asp and http://computer.howstuffworks.com/pci.htm

2.4.4 Use of cache memory Program instructions are usually read sequentially. From one instruction it would be reasonable to assume that the next instruction required will be in the next memory location. This assumption is used to increase processor efficiency. Although the movement of data within the processor is getting faster and faster, the system buses are not keeping up. This leads to wasted time while the processor waits for data to be fetched from memory.

­c HERIOT-WATT UNIVERSITY 2006 38 TOPIC 2. PROCESSOR STRUCTURE

cache

cache

CPU Cache RAM

Figure 2.4: To reduce this problem most machines now have a second, smaller, area of memory known as cache memory. This is usually SRAM which is faster than DRAM, and although this is much smaller than RAM there is a benefit from the fact that it is always faster to access a small memory segment. When data or an instruction is read from memory, the memory locations following are copied into the cache memory. At the next read instruction the cache memory is read first. If the data is in cache the access time will be much lower than going to main memory. If the data is not in cache then main memory will be accessed, and although there is a slight loss of time from reading twice, the overall time saving in this method is quite significant.

Observing cache memory On the Web is a simplified simulation of the operations of cache memory. You should now look at this simulation. Notes on cache memory:

¯ cache memory uses the faster but more expensive static RAM chips rather than the less expensive, but slower, dynamic RAM chips which are used for most of the main memory

¯ cache memory is connected to the processor by the "backside" bus

¯ normally whole blocks of data are transferred from main memory into cache, while single words are transferred along the backside bus from the cache to the processor.

The type of cache described above is now known as level 2 (L2) cache. L2 cache makes significant improvements in system performance as it reduces the need for the processor to use the system bus to fetch data from main memory. Access to cache is faster, because of the shorter distances, use of faster SRAM, and because the system bus can be used for other purposes at the same time. At the time of writing, many desktop PCs are being sold with 1Mb of L2 cache.

­c HERIOT-WATT UNIVERSITY 2006 2.4. IMPROVING PERFORMANCE 39

Checking current cache provision Check any computer supplier website to find out how much L2 cache is provided on current desktop and laptop systems. 10 min L1 and L2 cache Most modern chips also have level 1 (L1) cache. This is similar to L2 cache, but the cache is actually on the same chip as the processor. This means that it is even faster to access than L2 cache. For example, Pentium processors have two L1 caches on the processor. One of these is for caching data, while the other is use for caching instructions (machine code). In the Pentium 4 processor, each of these is 8Kbytes. Similarly, the PowerPC G4 processor has two 32Kb L1 caches.

2.4.5 Memory interleaving As we have seen, memory access is one of the major bottlenecks limiting performance of computer systems. Many techniques have been devised to overcome this, including use of SRAM, widening the data bus, using separate buses for memory and peripherals, and the use of L1 and L2 cache. Another technique which can be applied is called interleaving. The idea behind interleaving is that memory can be split up into 2 or 4 independent RAM chips. A memory read or write will normally take 3 or 4 clock cycles to perform. Data is actually being transferred along the data bus during only 1 of these clock cycles. The processor has to insert "wait" states into its program to allow for this. If memory interleaving has been implemented, the processor can use a "wait" state to initiate the next memory access, so saving time. In effect, the processor can access the 4 memory chips almost simultaneously. This increases throughput significantly. To make best use of this, successive data items must be stored in different memory chips, as shown in Figure 2.5

­c HERIOT-WATT UNIVERSITY 2006 40 TOPIC 2. PROCESSOR STRUCTURE

Chip 1 Chip 2 Chip 3 Chip 4

0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0018 0019 0020 0021 ......

Data Bus

Figure 2.5: In this case, the processor could start accessing location 0001 in chip 1; while waiting for this to arrive, it would initiate access to location 0002, which is in chip 2, and so on. In practice, memory interleaving is not implemented on most personal computers, although it is possible using Pentium Pro chips and above. It is found mainly on systems which require fast data throughput, such as servers. It is also used on the latest Apple computers. Memory interleaving is tricky to implement for memory fetches, as the processor has to deal with the data that arrives, which may require further processing steps. It is, therefore, more often used for memory writes, where the processor simply sends the data off to memory, and does not have to "worry" about what happens next. For a similar reason, memory interleaving is used to speed up access to video RAM. This is less problematic than main memory, as all the "data" is simply data, whereas in main memory, the "data" may be instructions!

2.4.6 Direct Memory Access (DMA) The final technique we will consider is direct memory access (DMA), which is used when data is being transferred to or from a peripheral device. First, we need to understand how data is transferred without DMA. There are two methods commonly used - programmed I/O, and interrupt-driven I/O. Consider a running program, which comes to an instruction which requires it to send a large block of data, currently held in main memory, to a backing storage device. Using programmed I/O, the processor fetches the first word from memory, stores it in a register, then sends it along the system bus to the interface or I/O module of the backing storage device, along with instructions to the I/O module. As the backing storage device is likely to be much slower than the processor, the processor must then wait until the device has completed the transfer. Once it has checked that the transfer is complete, it can fetch the next word from memory, and repeat the process. Clearly, this is wasteful

­c HERIOT-WATT UNIVERSITY 2006 2.5. SUMMARY 41

of processor time, especially if there is a large block of data to be transferred. A second option is to use interrupt driven I/O. Once again, the data is fetched from memory one word at a time, then sent to the I/O module for transfer to the device. Once sent, however, the processor continues with other instructions until the I/O module has completed the transfer. When this happens, the I/O module sends an interrupt signal to the processor. The processor suspends what it is doing, and organises the next data transfer, and so on. This is clearly more efficient than programmed I/O as the processor does not waste time waiting for the transfer to be completed. However, it does require the processor to carry out extra actions whenever it receives an interrupt. The inefficiencies of programmed and interrupt driven I/O are not too serious under most circumstances, but become a serious issue when large blocks of data are to be transferred between main memory and a slow peripheral. DMA is a technique which overcomes this. An additional device, called a DMA controller, is attached to the system bus. When the processor requires to make a block data transfer, it sends a command to the DMA controller. The DMA controller (DMAC) then takes over the data transfer. Data is transferred between main memory and the I/O module under DMAC control, without passing through the processor at all. This leaves the processor completely free to carry on with other tasks (so long as they do not require access to the system bus). This is particularly useful in multi-tasking or multi-user systems. Once the data block has been transferred, the DMAC sends an interrupt signal to the processor, informing it that the transfer has been completed. DMA is particularly useful for high speed disc transfers, and also for transferring data to the video I/O system for rapid display. There are many different configurations of DMA, with subtle complexities that are well beyond the scope of this course, but all work by delegating the transfer of data from the main processor, so that data can be routed directly between memory and I/O without assign through the processor registers.

2.5 Summary

¯ the main registers to be found in a processor are the MAR, MDR, program counter, instruction register and general purpose registers

¯ the MAR is attached to the address bus, and is used to hold the address of the memory location which is to be accessed

¯ the MDR is attached to the data bus, and holds data which is about to be sent, or has just arrived

¯ the processor continually repeat the fetch-execute cycle: fetch an instruction from memory, decode the instruction, increment the program counter, execute the instruction

¯ the PC always contains the address of the next instruction to be fetched

¯ PCI and PCI-X buses are used to communicate with slower peripherals, so that

­c HERIOT-WATT UNIVERSITY 2006 42 TOPIC 2. PROCESSOR STRUCTURE

these do not interfere with fast data transfer between processor and main memory along the frontside system bus.

¯ processor performance can be improved by

– increasing clock speed – increasing data bus width – use of cache memory – memory interleaving – direct memory access

2.6 Revision Questions Q18: The MAR

a) holds the address of a memory location about to be accessed b) holds data about to be sent to a memory location c) holds an address which has just been received from memory d) holds the instruction currently being executed

Q19: The correct sequence of steps in the fetch-execute cycle is

 

a) fetch instruction  increment PC decode instruction execute

 

b) increment PC  fetch instruction decode instruction execute instruction

 

c) fetch instruction  decode instruction increment PC execute instruction

  d) fetch instruction  decode instruction execute instruction increment PC

Q20: Which of the following is not a technique used to increase the speed of processing in a computer system

a) increasing clock speed b) widening the address bus c) widening the data bus d) use of cache memory

Q21: The data bus width of a Pentium or G5 chip is

a) 64 bits b) 32 bits c) 16 bits d) 8 bits

Q22: The PCI bus

a) is an internal processor bus b) connects the processor to L2 cache c) is used to send data to and from peripheral devices d) is the main bus connecting processor and memory

­c HERIOT-WATT UNIVERSITY 2006 2.6. REVISION QUESTIONS 43

Q23: Which of the following statements is correct? a) L2 cache is faster to access than L1 cache b) L2 cache is on the processor c) L1 cache is faster to access than L2 cache d) L1 and L2 cache are both on the processor

Q24: Memory interleaving is a technique a) which requires successive instructions to be store on the same memory chip b) which uses a DMA controller to by-pass the processor c) which requires successive instructions to be stored on different memory chips d) which can only be implemented using static RAM chips

Q25: The use of a separate processor which allows blocks of data to be transferred directly from memory to an I/O device is called a) interrupt driven I/O b) programmed I/O c) direct memory access d) interleaving

­c HERIOT-WATT UNIVERSITY 2006 44 TOPIC 2. PROCESSOR STRUCTURE

­c HERIOT-WATT UNIVERSITY 2006 45

Topic 3

Processor Function

Contents

3.1 Prior Knowledge and Revision ...... 47 3.2 Introduction ...... 48 3.3 Machine Code and Assembly Language ...... 48 3.4 Assembly language instructions ...... 49 3.4.1 Op-code and operand ...... 49 3.4.2 Instruction Formats ...... 50 3.4.3 Instruction types ...... 51 3.5 The 6502 - a Case Study ...... 52 3.5.1 Introducing the 6502 microprocessor system ...... 52 3.5.2 6502 structure ...... 53 3.5.3 6502 assembler instruction - LDA #55 ...... 57 3.5.4 6502 addressing modes ...... 62 3.5.5 The 6502 instruction set ...... 64 3.5.6 Some simple 6502 machine code programs ...... 67 3.6 Running Assembly Language Programs ...... 79 3.6.1 Format of an assembly language program ...... 79 3.6.2 Working with an assembler ...... 80 3.6.3 Your first assembly language program ...... 81 3.6.4 3 more examples ...... 86 3.6.5 Exploring other machine code programs (optional) ...... 91 3.7 Summary ...... 92 3.8 Revision Questions ...... 92

Learning Objectives After studying this topic, you should be able to:

¯ describe the structure of typical assembly language instructions using the terms op-code and operand

¯ describe and give examples of assembly language instructions of the following types

– data transfer 46 TOPIC 3. PROCESSOR FUNCTION

– arithmetic – logical – shift and rotate – branch

­c HERIOT-WATT UNIVERSITY 2006 3.1. PRIOR KNOWLEDGE AND REVISION 47

3.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ explain what is meant by machine code and assembly language

¯ describe the main registers to be found in a processor

¯ describe the relationship between these registers and the data and address buses

¯ describe the fetch-execute cycle in terms of buses and registers

Revision Q1: Machine code is a) a common programming language which can be used on any machine b) the low level language specific to a particular microprocessor c) the number which identifies a particular type of computer d) any high level language which can be compiled for use by a microprocessor

Q2: The processor register which holds the address of the next instruction to be fetched is the a) memory address register (MAR) b) instruction register (IR) c) program counter (PC) d) memory data register (MDR)

Q3: The correct sequence of steps in the fetch-execute cycle is

 

a) fetch instruction  increment PC decode instruction execute instruction

 

b) increment PC  fetch instruction decode instruction execute instruction

 

c) fetch instruction  decode instruction increment PC execute instruction

  d) fetch instruction  decode instruction execute instruction increment PC

Q4: Data fetched from memory arrives in a) the memory address register (MAR) b) the instruction register (IR) c) the program counter (PC) d) the memory data register (MDR)

Q5: In an 8-bit microprocessor, a) the data and address buses must both be 8 bits wide b) the data bus is 8 bits wide, but the address bus may be more than 8 bits wide c) the address bus must be 8 bits wide, but the data bus can be any width d) a single 8 bit wide system bus is used to transfer all data and addresses

­c HERIOT-WATT UNIVERSITY 2006 48 TOPIC 3. PROCESSOR FUNCTION

3.2 Introduction So far, we have studied the architecture of a computer system, and the internal architecture of a typical microprocessor. We have seen how processor, memory and input/output devices communicate with each other, and have considered techniques used to improve system performance. In this sub-topic, we will take an even closer look at the processor, the machine code instructions which it executes, and techniques which have been developed to improve processor performance.

3.3 Machine Code and Assembly Language Processor instructions are known as machine code, and consist of binary digits. So, a

machine code program might look like this:

½¼½¼½¼¼½ ¼¼¼¼¼¼¼½

½¼¼¼¼½¼½ ¼½½½¼¼¼¼

½¼½¼¼½¼½ ¼½½½¼¼¼¼

¼½½¼½¼¼½ ¼¼¼¼¼¼¼½

½¼¼¼¼½¼½ ¼½½½¼¼¼½

¼¼½¼¼¼¼¼ ½½½¼½½½¼

½½½½½½½½ ¼½½¼¼¼¼¼

This is, in fact, a program written in 6502 machine code to add 1 to a number stored. (6502 machine code is the machine code "understood" by the 6502 microprcessor, a simple microprocessor used in some early computers, such as the BBC micro, which was common in schools and homes in the 1980s).

Figure 3.1: A BBC Microcomputer The problem with machine code is that, although the processor can decode and execute it without any difficulty, it is almost impossible for humans to read and understand. So that machine code programmers don’t have to work directly with the binary codes, a type of language called assembly language has been devised. In assembly language,

­c HERIOT-WATT UNIVERSITY 2006 3.4. ASSEMBLY LANGUAGE INSTRUCTIONS 49

each binary machine code instruction is replaced by a mnemonic - a short code, often of 3 letters. machine code version assembly language version 10101001 00000001 LDA #1 10001101 00000011 STA 1000 11101000 10101101 00000011 LDA 1000 11101000 01101001 00000001 ADC #1 10001101 00000011 STA 1001 11101001 00100000 11101110 JSR OSWRCH 11111111 01100000 RTS It is important to realise that assembly languages are simply symbolic representations of the underlying machine code. As we are humans, rather than processors, we will discuss processor operation in the next few topics using assembly language rather than machine code.

3.4 Assembly language instructions 3.4.1 Op-code and operand Most machine code instructions consist of 2 parts, known as the op-code and the operand. The op-code is the first part, and defines what the instruction is to do - a move instruction, an add instruction, a jump instruction, and so on... The second part is the operand, which defines the data to be "operated" upon. So, for example, in the assembly language command LDA #1 (above), 10101001 (represented in assembler by LDA) is the op-code. It means load the accumulator (the main general purpose register in a 6502 processor) with some data. The operand in this case is 00000001, which is simply the number 1. The next command is 10000101 00000011 11101000 (represented by STA 1000). Here the op-code STA means "store the contents of the accumulator", and the operand 03E8(hex) means "in memory location 1000".

­c HERIOT-WATT UNIVERSITY 2006 50 TOPIC 3. PROCESSOR FUNCTION

3.4.2 Instruction Formats Not all instructions are the same length. For example, in 6502 machine code, the op-code is always 8 bits long, while the operand may be 0 bits (as in JSR) or 8 bits (as in LDA #1) or 16 bits (as in STA 1000):

Note that an 8 bit op-code means that the processor’s instruction set may have up to 256 different instruction types. Some types of processor have instructions with multiple operands. For example, the following instruction formats (among others) exist in the IBM mainframe instruction set ...

... and instructions for the Intel X86 family of processors can be anything from 1 to 15 bytes long, like this:

or 16 bit op-code 16 bit operand 16 bit operand

or

­c HERIOT-WATT UNIVERSITY 2006 3.4. ASSEMBLY LANGUAGE INSTRUCTIONS 51

Despite all these variations, we can recognise the basic pattern - an op-code followed by a number of operands. The op-code defines the action. The operand(s) contain data or addresses of data.

3.4.3 Instruction types Machine code or assembler instructions can be grouped into a number of categories: Data Transfer instructions - these are used to move data from one place to another. This could be moving data between registers within a processor, or fetching data from memory, or sending data to memory. Arithmetic instructions - used to carry out simple arithmetic, such as adding the contents of two registers, or adding 1 (incrementing) to the content of a register. Logical instructions - used to make logical comparisons, for example checking whether the contents of two registers are equal to each other. Shift and rotate instructions - these instructions manipulate the individual bits within a register. For example, if the "shift left 1 bit" instruction is applied to a register holding the data 00110011, the result will be 01100110. Each bit is shifted 1 bit to the left:

0 0 1 1 0 0 1 1

0 1 1 0 0 1 1 0

Note that the left hand bit is lost, and a 0 has to be added in as the new right hand bit. This may not seem a very useful instruction, but notice the value of the data before and

after the shift operation.

¼¼½½¼¼½½  ¿¾·½ ·¾·½  ½

¼½½¼¼½½¼  ·¿¾· ·¾  ½¼¾

Shifting the bits 1 place to the left is the equivalent of multiplying by 2.

Q6: What is the effect of shifting the bits 1 place to the right?

A rotate instructions works in a similar way, but the end bit (which is lost in a shift operation) re-appears at the other end:

0 0 1 1 0 0 1 1

0 1 1 0 0 1 1 0

­c HERIOT-WATT UNIVERSITY 2006 52 TOPIC 3. PROCESSOR FUNCTION

Branch instructions - these are the equivalent of an IF statement in a high level programming language. They allow decisions to be made in a machine code program, depending on the values held in various registers. To understand these better, we will look at some examples from a simple microprocessor, the 6502.

3.5 The 6502 - a Case Study The 6502 processor was the basis for the Acorn BBC range of microcomputers, which were used in schools and homes in the early 1980s. It was also used in a number of other microcomputers, including the Apple II, the Commodore Pet and the Atari. The 6502 is still used as an embedded microprocessor. Why study the 6502 processor, rather than a modern processor such as the Pentium Pro? The main reason for choosing the 6502 is its simplicity. By choosing to study a simple processor like the 6502, you will be able understand the basic principles of its operation. Modern processors are still based on these basic principles, so what you learn about the 6502 will help you understand the much more complicated processors currently being used and developed. If you want to find out more detailed information about the 6502 microprocessor, the following web sites are useful: http://www.obelisk.demon.co.uk/6502/ http://www.6502.org

3.5.1 Introducing the 6502 microprocessor system Figure 3.2 is a schematic diagram of a typical microcomputer system built around the 6502 microprocessor.

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 53

8 Bit Data Bus

I/O Bus XTAL Processor ROM RAM PIO I/O Devices I/O Bus CLOCK

16 Bit Data Bus

Control Lines

Figure 3.2: The 6502 microprocessor system The 6502 microprocessor system (introduced in about 1975) was among the first microprocessors to be used in early home computers. The 6502 processor included the usual Arithmetic/Logic Unit with some internal registers and a Control Unit all on the same chip. It had an external crystal-controlled clock to generate timing signals. Read-Only Memory (ROM) was used to hold a bootstrap program to permit initial operation of the system. Random Access Memory (RAM) was used to hold programs and data. The interface with the external devices was via a Programmable Input/Output unit (PIO), which communicated with the external devices using 16-bit wide I/O buses. The external bus was a combination of an 8-bit-wide data bus, a 16-bit-wide address bus and some control lines that carried synchronisation signals throughout the system. Hence only 8 bit data could be moved around the system, but 16-bit addresses could be used to address memory. Memory locations with addresses in the range 0 to 65535 (216-1) could be directly addressed. Memory was made up of individually addressable 8-bit words (bytes).

3.5.2 6502 structure The 6502 processor is an 8-bit processor. That means most of its registers can hold 8 bits of data, and it is connected to main memory by an 8 bit wide data bus. However, some of its registers can hold 16 bits. This is to allow them to hold addresses, and to communicate with memory along a 16 bit address bus. Here is a schematic diagram of the 6502 processor:

­c HERIOT-WATT UNIVERSITY 2006 54 TOPIC 3. PROCESSOR FUNCTION

There are six 8-bit registers, designed to hold data: A is known as the accumulator. It is the main general purpose register in which data is held during most arithmetic and logical operations. X and Y are index registers. They are designed to hold loop counters, especially for use in addressing successive memory locations. However, they can also be used as general purpose registers. SR is the status register. Although shown above as a single 8-bit register, it is really a set of 8 1-bit registers. Each bit operates independently, and is used as a flag to record the status of the processor after each instruction. From left to right, the flags are:

¯ N - the negative flag - set to 1 if the result of the last operation was a negative number (assuming two’s complement, i.e. the leftmost bit was 1)

¯ V - the overflow flag - set to 1 if the result of an operation is an invalid two’s complement result

¯ (next bit is not used)

¯ B - the break bit - set to 1 when a break instruction is executed

¯ D - the decimal flag - when set to 1 (by an SED instruction), the processor will operate using binary coded decimal arithmetic

¯ I - the interrupt disable flag -when set to 1 (by an SEI instruction), the processor will ignore any interrupt signals arriving on the interrupt line of the control bus.

¯ Z - the zero flag - set to 1 if the result of the last operation was zero

¯ C - the carry flag - set to 1 if the result of an operation is a number that cannot be stored in a single 8-bit register

So, the status register is better represented as:

NVBDIZC

The other two 8-bit registers are the IR (instruction register) and the MDR (memory data register). As you already know, the IR holds the instruction

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 55

currently being decoded or executed, and the MDR holds data which has just arrived along the data bus or is about to be sent along the data bus. There are three 16-bit registers, designed to hold addresses: The MAR is the memory address register, which holds an address about to be sent out along the address bus. The PC is the Program Counter, which holds the address of the next instruction to be fetched, decoded and executed. The SP is the stack pointer. The 6502 processor has a 256 byte stack, where data can be stored temporarily. The stack pointer holds the address of the next free location in the stack.

Matching 6502 registers Q7: Complete the table below by filling in the Register Name, Abbreviation and Bits columns. Register Name Abbreviation Bits Purpose Accumulator A 8 General purpose register - holds result of any calculation General purpose registers, used for loop counters Holds data arriving from data bus, or about to be sent on data bus Holds address to be sent out on address bus Holds address of next instruction to be fetched Holds instruction currently being decoded or executed Holds 7 individual flag bits Holds address of first empty location in 256 bit stack

Review Questions - the status register Here is the status register, as it might appear after certain operations in the 6502 processor:

NVBDIZC a) 10 000 00

b) 00 000 10

c) 00 000 01

d) 00 001 00

­c HERIOT-WATT UNIVERSITY 2006 56 TOPIC 3. PROCESSOR FUNCTION

Which state will the status register show after:

Q8: the calculation 00001000 - 00011111

a) a b) b c) c d) d

Q9: the calculation 11111111 + 00000011

a) a b) b c) c d) d

Q10: the instruction SEI

a) a b) b c) c d) d

Q11: the calculation 00001000 - 00001000

a) a b) b c) c d) d

3.5.3 6502 assembler instruction - LDA #55 Here is a diagram of a 6502 processor, connected to a section of memory. The addresses and data in memory and registers are shown in hexadecimal format (but remember that this is simply a convenient way of representing binary numbers):

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 57

SP X N V B D I Z C 0000 00 0 0 0 0 0 0 0 ALU PC Y A 2000 00 00 MAR MDR IR Control Unit 0000 00 00

Address Data 2000 A9 2001 55 2002 00 2003 14 2004 B3 2005 98 200600 2007 6D 2008 17 2009 28 200A D5 200B 1F 200C 33 200D AD 200E 00 200F 00 2010 00 2011 EB 2012 97

The current value in the PC is 2000. This means that the next instruction to be fetched is to be found in memory location 2000. The binary number in location 2000 is A9. This is the op-code LDA, which means load the accumulator with some data. The following sequence of diagrams shows in detail what happens next:

­c HERIOT-WATT UNIVERSITY 2006 58 TOPIC 3. PROCESSOR FUNCTION

Step 1 The value 2000 in the PC is moved to the MAR.

SP X SR

ALU PC Y A 2000 MAR MDR IR Control Unit 2000

Step 2 The control unit sends a read signal along the read line in the control bus to memory, which releases the data A9 from memory location 2000. A9 arrives in the MDR.

SP X SR

ALU PC Y A 2000 MAR MDR IR Control Unit 2000 A9

2000 A9 READ

Step 3 The PC is incremented by 1.

SP X SR

ALU PC Y A 2001 MAR MDR IR Control Unit 2000 A9

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 59

Step 4 A9 is identified as an instruction, so it is moved to the IR to be decoded.

SP X SR

ALU PC Y A 2001 MAR MDR IR Control Unit 2000 A9 A9

Step 5 A9 is decoded by the control unit. A9 is a load accumulator with data instruction. The control unit begins execution of this instruction.

SP X SR

ALU PC Y A 2001 MAR MDR IR Control Unit 2000 A9 A9

Step 6 The value 2001 in the PC is transferred to the MAR.

SP X SR

ALU PC Y A 2001 MAR MDR IR Control Unit 2001 A9 A9

­c HERIOT-WATT UNIVERSITY 2006 60 TOPIC 3. PROCESSOR FUNCTION

Step 7 The control unit sends another read signal to memory, which releases the value 55 from location 2001. 55 arrives in the MDR.

SP X SR

ALU PC Y A 2001 MAR MDR IR Control Unit 2001 55 A9

2001 55 READ

Step 8 The PC is incremented by 1.

SP X SR

ALU PC Y A 2002 MAR MDR IR Control Unit 2001 55 A9

Step 9 The control unit transfers the data from the MDR to the accumulator as required by the instruction A9.

SP X SR

ALU PC Y A 2002 55 MAR MDR IR Control Unit 2001 55 A9

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 61

3.5.4 6502 addressing modes 3.5.4.1 Immediate addressing

In the previous example, the instruction was LDA #55. In this case, the op-code is LDA, and the operand is the number 55. The operand is the actual data that is to be used. It follows immediately after the op-code. This is an example of immediate addressing. In immediate addressing, the format of the instruction in 6502 machine code is:

Here are some other examples of immediate addressing: CMP #27 which means "compare the contents of the accumulator with the number 27". LDX #A4 which means "load the X register with the number A4". LDY #00 which means "load the Y register with the number 00"

3.5.4.2 Direct addressing

LDA can be used in another way. An example would be LDA 3C15. This means "load the accumulator with the data which is stored in memory location 3C15". This is called direct addressing. In direct addressing, the format of the instruction in 6502 machine code is:

Here are some other examples of direct addressing: JMP 2015 which means "jump to the instruction in location 2015". ADC 2099 which means "add the data in location 2099 to the accumulator".

3.5.4.3 Implied addressing

Another form of addressing is known as implied addressing. In implied addressing there is no operand at all, so the format of the instruction is simply:

Some examples of this are: TAX which means "transfer the contents of the accumulator to the X register". TYA which means "transfer the contents of the Y register to the accumulator".

­c HERIOT-WATT UNIVERSITY 2006 62 TOPIC 3. PROCESSOR FUNCTION

DEX which means "decrement (reduce by 1) the value in the X register". INY which means "increment (increase by 1) the value of the Y register". TXS which means "transfer the value in the X register to the stack".

3.5.4.4 Other addressing modes

As well as immediate, absolute and implied addressing, the 6502 processor has many other addressing modes. These include:

¯ relative addressing - for example BEQ 09, which means "if the zero flag was set by the last operation, branch to the instruction 9 locations forward in memory"

¯ indexed addressing - for example LDA 3C15,Y which means "load the accumulator with the data to be found at the location 3C15+Y, where Y is the value stored in the Y register

¯ indirect addressing - for example LDA (3C15), which means "load the accumulator with the data you find at the address which is to be fond at location 3C15"

Figure 3.3: Immediate Addressing

Figure 3.4: Direct Addressing

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 63

Figure 3.5: Indirect Addressing

Figure 3.6: Index Addressing

3.5.5 The 6502 instruction set You have seen quite a few different 6502 instructions so far. The following sub-topics list the complete instruction set, categorised into groups, with illustrations of some of them. The diagrams show a "programmer’s view" of the processor - the MDR, MAR, IR, PC and SP have been omitted, so that you can see clearly the effects of the instructions on the registers which can be manipulated directly by the machine code programmer. These are the accumulator, X and Y registers, and the status register. A small block of main memory is shown alongside the processor. If you counted all the different instructions, you would find that the total number is 56. However, some of these can be use with different addressing modes, so that the real total number of different instructions in the 6502 instruction set is around 100. This is easily implemented using 8-bit op-codes, as 8 bits allows 256 different possible codes, more than enough to represent the 100 instructions in the set.

3.5.5.1 Data transfer instructions

The following table lists all data transfer instructions in 6502 machine code.

­c HERIOT-WATT UNIVERSITY 2006 64 TOPIC 3. PROCESSOR FUNCTION

mnenomic meaning flags set LDA load accumulator N, Z LDX load X register N, Z LDY load Y register N, Z STA store accumulator STX store X register STY store Y register transfer accumulator to X TAX N, Z register transfer accumulator to Y TAY N, Z register transfer X register to TXA N, Z accumulator transfer X register to TYA N, Z accumulator TSX transfer stack pointer to X N, Z TXS transfer X to stack pointer push accumulator on to PHA stack push status register on to PHP stack PLA pull accumulator from stack N, Z pull status register from PLP all stack You should now observe the online animation of a 6502 data transfer instruction.

3.5.5.2 Arithmetic instructions

The following table lists all arithmetic instructions in 6502 machine code.

mnenomic meaning flags set ADC add to accumulator N, V, Z, C SBC Subtract from accumulator N, V, Z, C increment a memory INC N, Z location INX increment X register N, Z INY increment Y register N, Z decrement a memory DEC N, Z location DEX decrement X register N, Z DEY decrement Y register N, Z You should now observe the online animation of a 6502 arithmetic instruction.

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 65

3.5.5.3 Logical instructions

The following table lists all logical instructions in 6502 machine code.

mnenomic meaning flags set CMP compare with accumulator N, Z, C CPX compare with X register N, Z, C CPY compare with Y register N, Z, C logical AND with AND N, Z accumulator exclusive OR with EOR N, Z accumulator inclusive OR with ORA N, Z accumulator logical AND but does not BIT N, Z, V keep the result You should now observe the online animation of a 6502 logical instruction.

3.5.5.4 Shift and rotate instructions

The following table lists all shift and rotate instructions in 6502 machine code.

mnenomic meaning flags set ASL arithmetic shift left N, Z, C LSR logical shift right N, Z, C ROL rotate 1 bit left N, Z, C ROR rotate 1 bit right N, Z You should now observe the online animation of a 6502 shift and rotate instruction.

3.5.5.5 Branch instructions

The following table lists all branch instructions in 6502 machine code.

mnenomic meaning flags set BCC branch if carry flag clear BCS branch if carry flag set BEQ branch if zero flag set BMI branch if negative flag set BNE branch if zero flag clear BPL branch if negative flag clear BVC branch if overflow flag clear BVS branch if overflow flag set

­c HERIOT-WATT UNIVERSITY 2006 66 TOPIC 3. PROCESSOR FUNCTION

These important instructions allow the programmer to insert conditional branches, depending on various flags which are clear (0) or set (1) in the status register. For example, a previous instruction could compare a value in memory with the value in the accumulator, using a CMP instruction. The processor does this by subtracting the value in memory from the accumulator. If they are equal the result will be zero, so the Z flag will be set to 1. If the next command was (e.g.) BEQ 08, then the PC will be increased by 08. If the zero flag was not set, then the program would simply continue to the next instruction.

3.5.5.6 Other instructions

The 6502 contains a number of other instructions which do not fit any of the categories above: mnenomic meaning flags set JMP jump to another location JSR jump to a subroutine RTS return from subroutine CLC clear carry flag C CLD clear decimal mode flag D CLI clear interrupt disable flag I CLV clear overflow flag V SEC set carry flag C SED set decimal mode flag D SEI set interrupt disable flag I BRK force an interrupt B NOP no operation RTI return from interrupt all

3.5.6 Some simple 6502 machine code programs In this sub-topic, we will look at 3 simple machine code programs. You can follow them step by step in the animated diagrams.

3.5.6.1 Adding two numbers

Program specification: Add the numbers stored in memory locations 2004 and 2005, and store the result in memory location 2006.

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 67

Assembly language program: CLC (clear the carry flag, just in case) LDA 2004 (fetch the number from location 2004) ADC 2005 (add on the number from location 2005) STA 2006 (store the result in location 2006)

At start of program:

X NVBD I Z C Address Data 00 0000001 2000 A9 2001 55 YA 2002 00 00 00 2003 14 2004 25 2005 36 2006 00 2007 6D 2008 17

­c HERIOT-WATT UNIVERSITY 2006 68 TOPIC 3. PROCESSOR FUNCTION

There is an animated version of these diagrams online.

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 69

3.5.6.2 Multiplying by 3

Program specification: Multiply the number stored in memory locations 2004 by 3, and store the result in memory location 2005. Assembly language program: CLC (clear the carry flag, just in case) LDA 2004 (fetch the number from location 2004) ADC 2004 (add the number to itself) ADC 2004 (add the number on again) STA 2005 (store the result in location 2005)

At start of program:

X NVBD I Z C Address Data 00 0000001 2000 A9 2001 55 YA 2002 00 B3 00 2003 14 2004 25 2005 36 2006 00 2007 6D 2008 17

­c HERIOT-WATT UNIVERSITY 2006 70 TOPIC 3. PROCESSOR FUNCTION

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 71

There is an animated version of these diagrams online.

3.5.6.3 Multiplying 2 numbers

Program specification: Multiply the numbers stored in memory locations 2004 and 2005, and store the result in memory location 2006. Method: We will use successive addition. This will require a loop. We will use the X register to keep count of how many more additions are required. Assembly language program: LDA 2004 (fetch the number from location 2004) (load the X register with the number from LDX 2005 location 2005) DEX (reduce X by 1 (see note 1 below)) CLC (clear the carry flag just in case) (add the number from location 2004 to the ADC 2004 current value in the accumulator) DEX (reduce the X register by 1) (if X is not zero, branch back 6 locations BNE -6 see note 2 below)) STA 2006 (store the result in location 2006) Notes:

1. If, for example, we are multiplying by 8, we need to add the number on to itself 7 times.

2. If X is zero, there are no more additions required, so the program continues to the STA instruction. If X is not zero, then the program must jump back to the ADC 2004 instruction. This is 6 bytes back in the program, since the length of ADC

­c HERIOT-WATT UNIVERSITY 2006 72 TOPIC 3. PROCESSOR FUNCTION

2004 is 3 bytes, DEX is 1 byte, and BNE -6 is 2 bytes.

At start of program:

X NVBD I Z C Address Data 00 0000001 2000 A9 2001 55 YA 2002 00 B3 00 2003 14 2004 25 2005 05 2006 00 2007 6D 2008 17

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 73

­c HERIOT-WATT UNIVERSITY 2006 74 TOPIC 3. PROCESSOR FUNCTION

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 75

­c HERIOT-WATT UNIVERSITY 2006 76 TOPIC 3. PROCESSOR FUNCTION

­c HERIOT-WATT UNIVERSITY 2006 3.5. THE 6502 - A CASE STUDY 77

­c HERIOT-WATT UNIVERSITY 2006 78 TOPIC 3. PROCESSOR FUNCTION

3.6 Running Assembly Language Programs For the practical outcome of the unit, you are required to implement simple assembly language instructions, using an assembler or an emulator. This sub-topic will enable you to try out some 6502 assembly language programs using a 6502 emulator.

3.6.1 Format of an assembly language program

An assembler program will usually consist of a number of lines of program code, with each line containing one or more of the following items:

1. a mnemonic; 2. operands (data operated on); 3. labels (destinations for jumps); 4. comments (additional information added by the programmer to clarify the purpose of programs to human readers); 5. white space (to improve readability of the program).

The majority of the activity performed by a processor is just moving values (bit patterns) between memory and registers. Remember that registers are just a specialised form of memory that is part of the processor itself.

A typical assembly program (not for any specific machine) might look like this:

'ËÌÊÌ *ÒØÖÝ ÔÓ,ÒØ Ó Ø- ÔÖÓ.Ö/Ѻ

Ä ´ µ *ÄÓ/" Ø- Ú/ÐÙ /Ø ÐÓ!/Ø,ÓÒ ,ÒØÓ

*Ø- /!!ÙÑÙÐ/ØÓÖº

 ´µ *"" Ø- !ÓÒØÒØ× Ó ÐÓ!/Ø,ÓÒ  ØÓ

*Ø- !ÙÖÖÒØ !ÓÒØÒØ× Ó Ø- /!!ÙÑÙÐ/ØÓÖ

ËÌÇ ´µ *ËØÓÖ Ø- Ö×ÙÐØ Ó Ø- !ÓÑÔÙØ/Ø,ÓÒ

*,Ò ÐÓ!/Ø,ÓÒ 

ÅÈ ´¿µ * ÓÑÔ/Ö Ø- Ö×ÙÐØ Û,Ø- Ø- !ÓÒØÒØ×

*Ó ÐÓ!/Ø,ÓÒ ¿

Â9Ì ÁÆÁËÀ *Á Ø- !ÓÒØÒØ× Ó Ø- /!!ÙÑÙÐ/ØÓÖ

*/Ö .Ö/ØÖ Ø-/Ò Ø- !ÓÒØÒØ× Ó

*ÐÓ!/Ø,ÓÒ ¿ Ø-Ò <ÙÑÔ ØÓ Ø- Ð/=Ð

*,Ò,×-

ËÌÇ ´¿µ *ËØÓÖ Ø- /!!ÙÑÙÐ/ØÓÖ Ú/ÐÙ ,ÒØÓ

*ÐÓ!/Ø,ÓÒ ¿

'ÁÆÁËÀ *Ò" ÔÓ,ÒØ Ó Ø- ÔÖÓ.Ö/Ѻ

­c HERIOT-WATT UNIVERSITY 2006 3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 79

The program itself starts with the label 'ËÌÊÌ. In this language, labels are introduced by a colon. Comments are introduced by semicolons and extend to the end of a line.

The next line contains a mnemonic Ä , which in this case stands for load, followed by an

operand, in this case (6), which indicates where the data is to be loaded from. The rest

ËÌÇ ÅÈ of the program introduces a few more mnemonics,  , for store, for compare

and Â9Ì for jump if greater than.

3.6.2 Working with an assembler An assembler is a program which converts from a textual representation of a machine language program into a representation suitable for loading into the machine. This textual representation is known as an assembly language. Each machine instruction has an easily remembered mnemonic. Different addressing modes can be specified, either directly, or by implication from the format of the operands given in the assembly

language program. Using the "/×Ñ assembler for the 6502, the program to add two

numbers that was given above would be expressed as:

Ä  ° ¼

 ° ½

ËÌ °

The $ symbol introduces a hexadecimal value. Ä  is a mnemonic which the assembler

program recognises as the code for the ‘Load Accumulator’ instruction. Likewise 

stands for ‘Add with Carry’ and ËÌ for ‘Store Accumulator’. The assembler takes care of translating the mnemonics into the correct code for the corresponding machine instruction, recognising the appropriate addressing mode, and hence selecting the correct hexadecimal representation of the instruction, given the addressing mode. The assembler maintains a location counter which is the address at which the next instruction or data item will be stored. As each instruction is encountered, the location counter is incremented by the appropriate amount. In the previous example, explicit memory addresses were specified for the data items. Although explicit numerical addresses might be required if it is essential that particular data items are stored at particular addresses, it is much more common for arbitrary addresses to be suitable. Most assemblers allow locations in memory to be given symbolic names, so that the programmer does not need to remember that the location

74 was where the total was being stored, but can write:

Ä  ÇÄ

 ?ÌÊ

ËÌ ÌÇÌÄ

?ÌÊ ÌÇÌÄ Elsewhere in the program, ÇÄ , , and are associated with actual locations. The programmer may choose to associate them with a particular location, by specifying the value for each symbol, or the assembler may assign addresses for the symbols, based on how big the user’s program is. If the addresses have been assigned by the assembler, then they will automatically change if more instructions are added to the program, or if extra data are added to the program.

"/×Ñ, in common with most assemblers, expects the assembly language program to be

­c HERIOT-WATT UNIVERSITY 2006 80 TOPIC 3. PROCESSOR FUNCTION

prepared with at most one statement on a line. The general format is as follows: label operation operand(s) ; comment The four fields of the line are separated by one or more space. Only the operation field is compulsory. Depending on the operation or directive, the other fields may not be required. The label field must start in the first column. It is only required to specify a symbolic name for the location. This may be used as the target for a jump instruction if needed, or as the name for a data item. If the first character on the line is a space then it is assumed that there is no label field. The operation field can be either a mnemonic

representing a machine operation, such as Ä , or a directive controlling the operation

of the assembler program itself in some way, such as , which defines a constant value to be placed in a storage location. The format of the operand field will depend on the operation being performed, and if it is a machine operation on the valid addressing modes for that operation. Finally, the comment field is preceded by a ‘;’ character. The comment is optional of course, and in general it should be used to explain the purpose of the statement,not merely to say in words what the statement does. In general, each line of an assembly language program will have at least either an operation or a comment. The label is optional and the presence or absence of operands will depend on the operation.

3.6.3 Your first assembly language program You are going to download an assembly language program, run it through an assembler to convert it into machine code, then run it on a 6502 processor emulator.

The program is designed to calculate 3 ¢ n + 1. However, this is not important, as its main purpose is to introduce you to the on-line assembler and the 6502 emulator. Step 1: download the file FIRST.ASM below

FIRST.ASM

½' ÇÊ9 ¼

¾' Ä * Ð/Ö /ÖÖÝ =,Ø

¿' Ä  @¼ * ÐÓ/" ,ÑÑ",/Ø ¼

'  Æ * /"" Ø- !ÓÒØÒØ× Ó ÐÓ!/Ø,ÓÒ Æ

' ËÄ * ÐØ ×-,Ø /!!ÙÑÙÐ/ØÓÖ

'  Æ

'  @½

' ËÌ ÊËÍÄÌ * ×/Ú Ø- Ö×ÙÐØ

' ÆÇÈ

½¼' ÊÃ

½½' ÇÊ9 º·¾¼

½¾' Æ ¿ * ,Ò,Ø,/Ð Ú/ÐÙ ¿

½¿' ÊËÍÄÌ Ë ½ * Ö×ÖÚ ×Ô/! This file (FIRST.ASM), can be downloaded from the course web site.

Code 3.1

­c HERIOT-WATT UNIVERSITY 2006 3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 81

Step 2: check its contents by opening it with a text editor (Notepad, SimpleText or similar), which should be similar to Figure 3.7

Figure 3.7: Step 3: use the on-line assembler to convert this into machine code (http://www.macs.hw.ac.uk/~pjbk/scholar/download2005.html) You should see something similar to Figure 3.8

Figure 3.8: At the bottom of the window, you will see the 6502 emulator. Step 4: run the machine code program in the 6502 emulator You will note that the 6502 emulator has two areas on the right of the screen where memory is displayed. The upper area shows a hexadecimal representation of the memory. Each row represents 8 bytes. The address of the first byte in the row is shown

­c HERIOT-WATT UNIVERSITY 2006 82 TOPIC 3. PROCESSOR FUNCTION

on the left, with the contents of the address on the right. You can scroll up and down this area to display different addresses. The lower area shows the contents of the memory in ASCII. Plain ASCII characters are represented by a space followed by their value.

Control characters (those with codes between 0 and °¾¼) are represented by the ASCII

caret character C followed by the appropriate alphabetic character. When the character

has its top bit set, then the prefix character is modified to ² if the character is a control

character, or E. The registers on the 6502 chip are displayed in the centre of the window. At the top is the clock. It records the number of memory cycles that have elapsed as the program runs. To the right of the clock is a register showing the speed of the emulator. When the

speed is 100, the emulator executes 1 clock cycle per second. The accumulator has the F ? and index registers immediately to its right. The status register is displayed both as

a register in hexadecimal, and with each of the status bits individually represented.ËÈ is

the stack pointer, whose purpose will be explained later. It is always initialised to °.

The È register is the program counter, storing the location of the next instruction to execute. The other registers displayed cannot be affected directly by the programmer.

Their contents change during the execution of programs. The ÁÊ register contains the instruction being executed. An assembly language version of that instruction is displayed

above the register. The ÅÊ is the address at which the next memory access will be

made, and Å Ê contains the data which is to be written or has been read from memory. Any of these registers can be modified by clicking on the display and editing the value. The register will only take on its new value when the Return key is pressed. Only hexadecimal values can be entered. Below the registers is a text area in which messages may appear, for example if an

undefined operation is executed, or when the ÊÃ operation is executed, stopping the program. If tracing is selected, the output appears here.

­c HERIOT-WATT UNIVERSITY 2006 3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 83

Figure 3.9: 6502 Emulator The top of the window displays two rows of buttons.

¯ Clear Registers Resets all registers to their initial values. That is, all registers are

set to 0, except for the stack pointer which is set to °. Memory is unaffected.

¯ Zero memory. Memory is cleared to all 0 bytes. Register values are unaffected

¯ Load.Allows the user to select a file and load the data from that file. When running as an application, the user can select any file. If the emulator is running as an applet on a web page, then a selected list of programs will appear from which the user can select one.

¯ Step. Executes a single instruction at the address specified in the program counter.

¯ Run. Executes instructions starting at the address in the program counter,

continuing until a ÊÃ instruction is reached.

¯ Trace. When this button has been pressed, a listing of all changes to all the registers is written to the message area. It may be helpful in debugging programs. The time from the clock register is printed before each message.

¯ Quit. Exits the program. The left of the screen shows a slider that controls the speed of the emulator.

­c HERIOT-WATT UNIVERSITY 2006

84 TOPIC 3. PROCESSOR FUNCTION ,Ö×غ=,Ò ¯ Load the program using the Load button. You will observe that the

memory has been updated. Locations 0 to °" now contain the program (because

°¼¼ °¼¼ °!

the ÊÃoperation has the value only locations to appear to be modified). °¾¿

Locations °¾¾ and contain the data. Scrolling down, you will observe that °¾¿

location °¾¾ has the value .

ÇÊ9 ¼ Notice that ÁÊË̺ËÅ specified for its first instruction. This means that the

default value of the È register is suitable for starting the program running. Unless

you have a particular purpose in mind, you should always start your programs with

ÇÊ9 ¼

If you wish your program to start execution at some location other than 0, then

ËØÔ ÊÙÒ you must edit the È register before pressing the or buttons. Replace

the 0 in teh È register with the hexadecimal address at which the program’s first

 

instruction is located. After you have modified È be sure to press the Return  or Enter key, or the change will be ignored.

¯ You should now step through the program, one instruction at a time, using the Step button. After each instruction has executed, examine the memory and registers to see the changes that have occurred.

1. The first instruction does not affect the memory, and does not appear to affect Ä the accumulator or index registers. °½ is the instruction which clears the

carry flag. Initially it is 0 anyway, so no effect is seen. The È register is advanced by 1, since the instruciton is only a single byte long. The clock at the top of the window shows that the instruction took 3 memory cycles to

operate. ¼¼

2. The second operation also shows no effects. In fact, °/ is the instruction @¼ for Ä  , that is, load the accumulator, using an immediate operand, in this case 0. Since the accumulator is 0 initially, it does not change. The program counter advanced by 2, since the instruction is two bytes long. In the status register, the Zero flag is set, since the result is zero. The instruction register

contains the machine instruction°/ . The clock at the top of the window shows that the instruction took 2 memory cycles to operate.

3. The contents of memory location °¾¾ are added to the accumulator. The Zero flag is cleared because the result in the accumulator is non zero. This operation required 3 memory cycles. 4. The accumulator is shifted left by one bit position, effectively doubling its value.

5. The contents of location °¾¾ are added to the accumulator again, giving a

value of ° .

6. An immediate operand of 1 is added to the accumulator. °¾¿ 7. The resulting value of ° / is stored at location . You can observe that this

has the value < in the ASCII view of memory.

8. A ÆÇÈ is a no-operation; it has no effect on any memory or register value. Its main use in our example programs is to provide an easily recognised bit pattern, at which programs can be halted for the registers to be inspected.

9. Finally, a ÊÃ operation, which has code 0, is executed. In a real 6502 chip this causes an address to be loaded off the stack, as if an interrupt had occurred. In the emulated chip used here, the program halts. You will observe

­c HERIOT-WATT UNIVERSITY 2006 3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 85

that the status register has the B and I bits set. The program took 22 cycles to execute.

¯ Now press the Clear button. This resets all registers to 0. After pressing the Step

button once, set the Carry bit, by clicking on the C box above the ËØ/ØÙ× register. As you step through the rest of the program, observe what effect this has had. The Carry bit has been added to the accumulator, so that when the result is finally stored, it is two larger than expected. The ASCII view of memory shows the value

as Ð.

¯ Now press the Clear button. This time press the Run button. The program should

proceed from instruction to instruction automatically until it reaches the ÊÃ.You can modify the speed at which it executes using the slider at the left hand side of the window. Notice that the speed set there has no effect on the number of memory cycles that the program uses.

¯ Now press the Clear button, and then press the Trace button. As the program executes, whether step by step, or without stopping, the intermediate values of all registers along with the clock value are printed in the message area. We will not use this feature very often, but it may help you to debug your programs. You can scroll back through the message area to examine all the changes.

3.6.4 3 more examples Previously you observed animations showing 3 simple programs being executed on a simulation of a 6502 processor. Now you are going to run these on the full 6502 on-line emulator.

Machine Code program 1 - adding 2 numbers

The first program you observed on the animation was:

Ä ´!Ð/Ö Ø- !/ÖÖÝ Ð/.¸ <Ù×Ø ,Ò !/×µ

Ä  ¾¼¼ ´Ø!- Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ µ

 ¾¼¼ ´/"" ÓÒ Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ µ

ËÌ ¾¼¼ ´×ØÓÖ Ø- Ö×ÙÐØ ,Ò ÐÓ!/Ø,ÓÒ ¾¼¼ µ

For the full on-line emulator, we need to add some extra instructions and data to the program, but you will see these 4 essential instructions in the middle of the program.

­c HERIOT-WATT UNIVERSITY 2006 86 TOPIC 3. PROCESSOR FUNCTION

Step 1: Use notepad, simple text or any other text editor to create the following text file:

Notes:

¯ use TAB to inset the column of assembly language instructions.

¯ ORG 0 means the first instruction of the program will be located at memory location 0

¯ The program ends with NOP (no operation) and BRK (force interrupt)

¯ ORG .+10 means that the next line of code (which is the data we will be accessing) is to be stored 10 bytes further on in memory

¯ Rather than LDA 2004, we use LDA FIRST - FIRST is a label

¯ At the label FIRST, we store the data 37 (DC means Declare Constant)

¯ At the label SECOND (rather than at location 2005), we store 54

¯ The result will be stored at label RESULT rather than in location 2007 (DS means Declare Space)

Step 2: Save this text file as add2nums.asm Step 3: use the on-line assembler to convert this into machine code Step 4: run the machine code program in the 6502 emulator (it will be useful to have the file add2nums.asm open in a window alongside the emulator, so you can follow the program more easily):

­c HERIOT-WATT UNIVERSITY 2006 3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 87

You should be able to follow the program execution step by step. Watch out for

¯ The values in the accumulator changing

¯ The PC incrementing after each instruction is fetched

¯ The result being stored in memory at the end

Machine Code program 2 - multiplying a number by 3

The second program you observed on the animation was:

Ä ´!Ð/Ö Ø- !/ÖÖÝ Ð/.¸ <Ù×Ø ,Ò !/×µ

Ä  ¾¼¼ ´Ø!- Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ µ

 ¾¼¼ ´/"" Ø- ÒÙÑ=Ö ØÓ ,Ø×е

 ¾¼¼ ´/"" Ø- ÒÙÑ=Ö ÓÒ /./,Òµ

ËÌ ¾¼¼ ´×ØÓÖ Ø- Ö×ÙÐØ ,Ò ÐÓ!/Ø,ÓÒ ¾¼¼ µ

Again, for the full on-line emulator, we need to add some extra instructions and data to the program, but you will see the 5 above essential instructions in the middle of the program.

­c HERIOT-WATT UNIVERSITY 2006 88 TOPIC 3. PROCESSOR FUNCTION

Step 1: Use notepad, simple text or any other text editor to create the following text file:

Notes:

¯ As before, the code is to be located at memory location 0, so we start with ORG 0

¯ The program again ends with NOP (no operation) and BRK (force interrupt)

¯ Rather than loading the number from memory location 2004, we use LDA NUM - where NUM is a label for data stored 10 bytes after the end of the program

¯ The result will be stored at label RESULT rather than in location 2007 (DS means Declare Space)

Step 2: Save this text file as multx3.asm Step 3: use the on-line assembler to convert this into machine code Step 4: run the machine code program in the 6502 emulator (once again, it will be useful to have the file multx3.asm open in a window alongside the emulator, so you can follow the program more easily) You should be able to follow the program execution step by step. Watch out for:

¯ The values in the accumulator changing

¯ The PC incrementing after each instruction is fetched

¯ The result being stored in memory at the end

Machine Code program 3 - multiplying 2 numbers

The third program you observed on the animation was:

Ä  ¾¼¼ ´Ø!- Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ µ

­c HERIOT-WATT UNIVERSITY 2006

3.6. RUNNING ASSEMBLY LANGUAGE PROGRAMS 89

Ä ? ¾¼¼ ´ÐÓ/" Ø- ? Ö.,×ØÖ Û,Ø- Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ µ

? ´Ö"Ù! ? =Ý ½ ´× ÒÓØ ½ =ÐÓÛµµ

Ä ´!Ð/Ö Ø- !/ÖÖÝ Ð/. <Ù×Ø ,Ò !/×µ

 ¾¼¼ ´/"" Ø- ÒÙÑ=Ö ÖÓÑ ÐÓ!/Ø,ÓÒ ¾¼¼ ØÓ Ø- !ÙÖÖÒØ Ú/ÐÙ ,Ò Ø- /!!ÙÑÙÐ/ØÓÖµ

? ´Ö"Ù! Ø- ? Ö.,×ØÖ =Ý ½µ

Æ ¹ ´, ? ,× ÒÓØ ÞÖÓ¸ =Ö/Ò!- =/!I ÐÓ!/Ø,ÓÒ× × ÒÓØ ¾ =ÐÓÛµµ

ËÌ ¾¼¼ ´×ØÓÖ Ø- Ö×ÙÐØ ,Ò ÐÓ!/Ø,ÓÒ ¾¼¼ µ

Again, for the full on-line emulator, we need to add some extra instructions and data to the program, and write the BNE instruction slightly differently, but you will see all the above essential instructions within the program. Step 1: Use notepad, simple text or any other text editor to create the following text file:

Notes:

¯ As before, the code is to be located at memory location 0, so we start with ORG 0

¯ The program again ends with NOP (no operation) and BRK (force interrupt)

¯ Rather than loading the numbers from memory locations 2004 and 2005, we use LDA NUM and LDX TIMES to fetch the data from these labelled locations in memory - which, as before, are located 10 bytes after the end of the program

¯ The result will be stored at label RESULT rather than in location 2006

¯ BNE 7 is used because this points to the value of the PC for the ADC NUM instruction at the start of the loop - you can check it out as follows:

– CLC (1 byte instruction) at location 0

­c HERIOT-WATT UNIVERSITY 2006 90 TOPIC 3. PROCESSOR FUNCTION

– LDA NUM (2 byte instruction) at location 1 and 2 – LDX TIMES (2 byte instruction) at locations 3 and 4 – DEX (1 byte instruction) at location 5 – CLC (1 byte instruction) at location 6 – ADC NUM (2 byte instruction) at locations 7 and 8 – ...so the PC must point to location 7 to repeat the loop

Step 2: Save this text file as mult2nums.asm Step 3: use the on-line assembler to convert this into machine code Step 4: run the machine code program in the 6502 emulator (once again, it will be useful to have the file mult2nums.asm open in a window alongside the emulator, so you can follow the program more easily) You should be able to follow the program execution step by step. Watch out for:

¯ The values in the accumulator changing

¯ The PC incrementing after each instruction is fetched

¯ The value in the X register decreasing by 1 each time around the loop

¯ The N and V flags being set when the value in the accumulator requires the leftmost bit

¯ The Z flag being set when X decrements to zero, so that the BNE branch back is not taken

¯ The result being stored in memory at the end

Step 5: Test the program multiplying some other (small) numbers. To do this, you only need to alter the 2 lines at labels NUM and TIMES to put the numbers you want into memory. The program itself will not be any different. You have now done enough machine code programming to satisfy the unit practical assessment requirement. Make sure your tutor records this on your practical skills checklist.

3.6.5 Exploring other machine code programs (optional) If you have time, you can explore machine code programming further in two ways:

¯ Download and examine some of the sample programs provided (http://www.macs.hw.ac.uk/~pjbk/scholar/download2005.html)

¯ Try writing some simple assembly language programs of your own.

­c HERIOT-WATT UNIVERSITY 2006 3.7. SUMMARY 91

3.7 Summary

¯ instructions for a microprocessor are in machine code (binary number) form

¯ to make it easier for programmers, assembly languages are used as a symbolic representation of machine code

¯ a typical assembly language instructions has two parts: the op-code and the operand

¯ the op-code is a mnemonic (e.g. LDA) which represents the action to be take

¯ the operand is the data to be operated on

¯ assembly language instructions can be classified into the following types

– data transfer (e.g. LDA, TAX) – arithmetic (e.g. INX, ADC) – logical (e.g. CMP, AND) – shift and rotate (e.g. ASL, ROR) – branch (e.g. BCC, BNE) – others (e.g. RTS, CLC)

¯ many instructions can be used with different addressing modes, such as

– immediate addressing, where the operand is the actual data (e.g. LDA #55) – direct addressing, where the operand is the address of the data (e.g. LDA 3C15) – implied addressing, where there is no operand (e.g. TXA)

¯ the 6502 processor is a simple processor which was commonly used for microcomputers in the 1980s, and is still in use in embedded systems today

¯ the 6502 processor has 3 general purpose registers, the accumulator and X and Y registers

¯ the 6502 instruction set consists of around 100 different op-codes

¯ although modern computers use much more complex processors than the 6502, they work on the same basic principles

3.8 Revision Questions Q12: Which of the following could be a machine code instruction? a) 10010110 11001010 b) LDA 24A7 c) dasm fist.bin d) FOR index = 1 TO 10 DO

­c HERIOT-WATT UNIVERSITY 2006 92 TOPIC 3. PROCESSOR FUNCTION

Q13: Which of the following could be an assembly langauge instruction?

a) 10010110 11001010 b) LDA 24A7 c) dasm fist.bin d) FOR index=1TO10DO

Q14: In the instruction ADC #56

a) ADC is the operand and #56 is the op-code b) ADC #56 is the op-code, and there is no operand c) ADC is the op-code and #56 is the operand d) ADC #56 is the operand, and there is no op-code

Q15: The instruction TYA from the 6502 instruction set is an example of

a) a data transfer instruction b) an arithmetic instruction c) a shift instruction d) a logical instruction

Q16: The instruction ROL from the 6502 instruction set is an example of

a) a branch instruction b) an arithmetic instruction c) a shift instruction d) a logical instruction

Q17: The 6502 instruction STA 3C15 is an example of

a) implied addressing b) immediate addressing c) absolute addressing d) direct addressing

Q18: In 6502 assembly language, to fetch the number 55 from memory location 2345, you could use

a) LDA #55 b) STA #55 c) STA 2345 d) LDA 2345

­c HERIOT-WATT UNIVERSITY 2006 3.8. REVISION QUESTIONS 93

Q19:

The following is a section of assembly language program:

' ÄF

ÈÍËÀ ʽ 'ÔÖ×ÖÚ ʽ

ÄÇ ʽ¸´ÈÍ˵ 'ÐÓ/" "Ð/Ý ÐÒ.Ø- ,ÒØÓ Ê½

'Æ?Ì

 ʽ '!ÓÙÒØ "ÓÛÒ ½

ÅÈ Ê½¸¼ 'Ö/!-" ÞÖÓ ÝØJ

ÂÆK Æ?Ì ', ÒÓظ .Ó =/!I ØÓ Æ?Ì

"ÓÛÒ ½ In the program above, "!ÓÙÒØ "is a) an op-code b) a label c) a comment d) a mnemonic

­c HERIOT-WATT UNIVERSITY 2006 94 TOPIC 3. PROCESSOR FUNCTION

­c HERIOT-WATT UNIVERSITY 2006 95

Topic 4

Processor Performance

Contents

4.1 Prior Knowledge and Revision ...... 97 4.2 Introduction ...... 98 4.3 RISC and CISC ...... 99 4.3.1 CISC Design ...... 100 4.3.2 RISC Design ...... 100 4.3.3 Review questions 1 ...... 101 4.4 SIMD ...... 102 4.5 Pipelining ...... 103 4.5.1 The principles of pipelining ...... 103 4.5.2 Problems with pipelining ...... 105 4.5.3 Optimising the pipeline ...... 108 4.6 Superscalar Processing ...... 110 4.7 Summary ...... 111 4.8 Revision Questions ...... 112

Learning Objectives After studying this topic, you should be able to:

¯ describe the key differences between RISC and CISC microprocessor design, including

– size of instruction set – use of internal registers – addressing modes

¯ explain how the use of SIMD instructions leads to a performance gain

¯ describe how pipelining operates and how it improves system performance

¯ describe possible problems with pipelining caused by branch instructions and instructions of different lengths

¯ describe how the following techniques can be used to optimise the instruction and data streams 96 TOPIC 4. PROCESSOR PERFORMANCE

– branch prediction – data flow analysis – speculative loading of data – speculative execution of instructions – predication

¯ describe how superscalar processing operates

­c HERIOT-WATT UNIVERSITY 2006 4.1. PRIOR KNOWLEDGE AND REVISION 97

4.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ Describe and exemplify typical machine code or assembler instructions

¯ Describe the structure of a simple microprocessor such as the 6502

¯ Explain what is meant by the instruction set of a microprocessor

¯ Explain what is meant by addressing modes, and give examples of immediate, direct and implied addressing

Revision Q1: A machine code instruction normally consists of a) an operand followed by an op-code b) an op-code followed by one or more operands c) one or more op-codes followed by an operand d) either an op-code or an operand

Q2: Which statement correctly describes the relationship between machine code and assembly language? a) machine code and assembly language are the same b) machine code is processor specific, but assembly language is universal c) assembly language is a human representation of binary machine codes d) machine codes are a human representation of assembly language

Q3: The instruction set of a microprocessor is a) a description of its internal registers b) the way that its control unit operates c) a list of all its machine code instructions and addressing modes d) the order in which it decodes and executes instructions

Q4: LDA 3C15 in 6502 assembler means "load the accumulator with the data to be found at memory location 3C15". This is an example of a) immediate addressing b) implied addressing c) relative addressing d) direct addressing

Q5: An example of implied addressing in 6502 assembler is a) STA 2000 b) LDX #67 c) CMP 3C15 d) TXA

­c HERIOT-WATT UNIVERSITY 2006 98 TOPIC 4. PROCESSOR PERFORMANCE

4.2 Introduction So far, we have studied the architecture of a computer system, the internal architecture of a typical microprocessor, and the actual functioning of a simple microprocessor like the 6502. You have seen and used simple assembly language instructions, and seen how they allow data to be manipulated within and outside the processor. The 6502 was an 8-bit processor, with a 16-bit address bus. It proved to be very successful when introduced, but the limitations of an 8-bit processor soon had the developers working on new designs. The step up to 16-bit general purpose processors happened in the late 1970s with the introduction of the Intel 8086. The Intel 8086/88 (1979) was the processor of choice for the IBM PC which has led to the increasingly powerful and all-pervasive PC. Its design was based on the Intel 8080 and later 8086 processors. The processor had 16-bit data and address buses and allowed fetch and execute to be concurrent. This was accomplished by the use of a pre-fetch queue which could be filled during a fetch phase while an instruction was being executed. There were four 16-bit general-purpose registers and four 16-bit Index Registers. Four Segment Registers allowed addresses to be extended to 20 bits, giving access to a megabyte of memory in 64 Kbyte chunks. As a consequence any data structure was limited to a maximum size of 64 Kbytes. Early processors ran at 4.7 MHz though later models increased this. Intel has spawned from this early beginning the increasingly influential series of processors from the 80286, 80386, 80486 through to the Pentium I, II and III. These processors have been developed from being 16-bit processors to 32-bit and ultimately 64-bit. This range of processors will be looked at later. At the same time as Intel was developing the 8086, Motorola were developing the Motorola 68000 CPU (1979). This was a much more sophisticated CPU than the 8086 and also became the first in a family of processors, the 68020, 68040, etc. The 8086 was designed with backward compatibility with earlier 8-bit designs in mind while the 68000 was designed with an eye on producing a full 32-bit architecture in the future. In fact the 68000 was internally a full 32-bit architecture. Externally however it had a 16-bit data bus and a 24-bit address bus. Addresses were given as 32-bits but eight of these bits were not used. Because of the 24-bit address bus, 16 Mbytes could be directly addressed without the use of segment registers and there was no 64 Kbyte restriction on the size of directly addressed data structures. All operations were full 32-bit operations taking place in sixteen 32-bit general-purpose registers split into eight data registers and eight address registers. The initial processor ran at 8 MHz but later versions ran at up to 20 MHz. The whole processor was designed for expansion into a full 32-bit system with hardware floating-point to be incorporated at a later stage. It also could fetch the next instruction during execution. This architecture family will also be looked at in more detail later. During the 1990s various other processors appeared. Among these was the PowerPC (Power Performance Chip) developed by IBM and Motorola (1992). The PowerPC processor range has been used in Apple’s and iMac range of computers since 1994. The PowerPC processor was one of the first superscalar architectures and incorporated both integer and floating-point pipelines with a 32 Kbyte cache on chip. It could execute three instructions at a time and the fetch unit could pre-fetch up to eight

­c HERIOT-WATT UNIVERSITY 2006 4.3. RISC AND CISC 99

instructions at a time from the cache. It had many RISC features including thirty-two 32-bit registers as well as thirty-two 64-bit floating point registers. Early processors ran at 50 and 66 MHz. More recent processors in the family have become faster and increasingly superscalar with the addition of more integer and floating-point pipelines. They are now full 64-bit architectures with 64-bit registers and data paths. Cache sizes have increased with a 256 Kbyte level 1 cache and a 1 Mbyte level 2 cache incorporated on the chip. We will take a look at four technological developments (three of which are mentioned above) which have led to improvements in processor performance. These are

¯ RISC processors

¯ SIMD

¯ pipelining

¯ superscalar processing

4.3 RISC and CISC As new features were added to the design of the processor the instruction sets of the processors were expanded to take advantage of these new features. This is usually done in such a way as to preserve backward compatibility, no instruction is removed. This means that programs written for earlier models of the processor will still run on the more recent processors. Thus an Intel Pentium processor with 32-bit registers and 32-bit addresses can still run instructions written for the 8086 with its 16-bit addresses and use of segment registers, even though programs written for the Pentium itself make no use of the segment registers. Among the reasons for the increase in features and complexity in processors has been the increase in the power and sophistication of high-level programming languages. As well as allowing the concise description of complex algorithms they also support the use of structured programming and object-oriented techniques. This has led to a steadily widening gap between the operations carried out in high-level programming languages and the operations available in the processors on which they are executed. This gap between language operations and the operations available in processors has led computer architecture designers to:

1. increase the size of instruction sets for microprocessors by providing many more operations;

2. design ever more complex instructions;

3. provide more addressing modes;

4. implement some high-level language constructs in machine instruction sets.

­c HERIOT-WATT UNIVERSITY 2006 100 TOPIC 4. PROCESSOR PERFORMANCE

4.3.1 CISC Design This philosophy of attempting to keep pace with high-level language design by designing increasingly complex processors with ever more complex instruction sets and addressing modes is still followed. Processor families that follow this philosophy have been called Complex Instruction Set Computers (CISC). A typical CISC range of computers is the family started by the Intel 8086 that has developed through the 80286, 80386, 80486 and into the Pentium series. A CISC-type processor is very difficult and expensive to design and build if the logic for each complex instruction has to be hard-wired into the control unit. An approach that has been used to solve this problem is to use a technique called microprogramming. In this approach each complex instruction is split into a series of simpler instructions called microinstructions. These microinstructions are directly executable by the processor. When a complex instruction is to be executed, the CPU actually executes a corresponding small microprogram stored in a control memory. This simplifies the design of the processor and allows the addition of new complex instructions by storing the appropriate microprogram in control memory.

4.3.2 RISC Design Another approach to processor design is also possible. Much research has been done to try to determine the pattern of execution of machine language instructions generated by high-level language programs. This research has shown that:

1. The most common statement is the assignment statement. That is, data movement operations predominate.

2. Conditional statements, whether in the form of IF statements or as tests for completion at the end of loops are very common.

3. Most of the references to variables are to simple scalar variables, for example integers and floating point values. In addition these references are local to the currently executing portion of code.

As a result, research has been done since about 1980 into an entirely different approach to computer architecture. This approach has attempted to make the architecture simpler. The first thing done was to reduce the number of instructions and to make them all have the same format, as far as possible. This makes the design of the control unit much simpler. The studies of operand use also indicated that it would be worthwhile to try to reduce the number of memory accesses made by instructions. This can be achieved by increasing the number of registers so that much-used operands could be available inside the CPU and would not have to be continually stored to and loaded from memory. Another method of achieving simplicity was to reduce the number of addressing modes. Also because of the many loop and if statements it was considered that the sequence control mechanisms should be optimised to allow efficient pipelining of instructions. This has led to the development of the type of architecture that is called the Reduced Instruction Set Computer, commonly called RISC processors as distinct from CISC processors.

­c HERIOT-WATT UNIVERSITY 2006 4.3. RISC AND CISC 101

The characteristics that most RISC processors share are:

¯ a large number of general-purpose registers;

¯ a small number of simple instructions that mostly have the same format;

¯ a minimal number of addressing modes;

¯ optimisation of the instruction pipeline.

Table 4.1 illustrates the difference between a well-known CISC processor and a well- known RISC processor.

Table 4.1: The difference between RISC and CISC CISC processor RISC processor Intel 80486 Sun SPARC Year developed 1989 1987 No. of instructions 235 69 Instruction size (bytes) 1-11 4 Addressing modes 11 1 No. of general purpose registers 8 40-520 The microprogram approach to CISC architecture has the advantage that complex circuitry is not required to implement complex instructions and also allows the customisation or addition of new instructions by additions to the microprogram. However the proponents of the RISC architecture say that these benefits do not outweigh the cost of running the microprogram. They argue that it is better to design a simple processor with a small well-designed instruction set which is directly executed by the CPU. One consequence of this is that programs for RISC processors tend to be longer. There has been no definite conclusion in the debate between the respective merits of RISC and CISC architectures and both processor types are available commercially. In fact the Intel Pentium II is a leading CISC processor but actually includes some RISC features while the IBM/Motorola PowerPC processor is a RISC processor with some CISC features.

Characteristics of RISC Processors On the Web is an activity that asks you to identify the characteristics of RISC processors. You should now do this activity. 10 min

4.3.3 Review questions 1 Q6: How do processor designers ensure that software written for early processors in a family of processors will still execute on later members of the family? Q7: Which of the following features of a processor would you identify with a RISC processor? 1. Many general-purpose registers. 2. A very wide systems bus. 3. Several addressing modes.

­c HERIOT-WATT UNIVERSITY 2006 102 TOPIC 4. PROCESSOR PERFORMANCE

4. Only one format of instruction. 5. A large amount of memory. 6. A small instruction set.

4.4 SIMD A standard machine code instruction operates on a single data item. It could be moving the data from one place to another, or carrying out some arithmetical or logical operation on it. Processors are designed to execute instructions very rapidly one after another. As computers developed, more and more use was made of graphics. To manipulate graphics, it is often necessary to apply the same instruction to a large number of data items. For example, to change the background colour of an image, the processor must access each of the bytes of data in the screen memory (bit map), change its value, and return it to the screen memory. The same instruction must be applied to each byte, and there could be thousands of these. To speed up this type of process, chip designers developed new architectures with large multi-byte registers, and added instructions to the instruction set which could simultaneously operate on many of these registers. These type of instructions are known as SIMD (single instruction multiple data). SIMD instructions have long been available on mainframe computer systems, but have only more recently become available on microcomputer systems. The first implementation of SIMD to microcomputers was the MMX instructions introduced on the Intel Pentium chip in 1996. The new MMX instructions were able to carry out a range of actions, including:

¯ addition

¯ subtraction

¯ multiplication

¯ comparison

¯ pack and unpack (spilt or combine data items)

¯ logical operations (AND, OR, NOT)

¯ shift

¯ data transfer

to any combination of

¯ 8 single bytes

¯ 4 double bytes

¯ 2 quadruple bytes

all held in a single 64 bit register.

­c HERIOT-WATT UNIVERSITY 2006 4.5. PIPELINING 103

The Pentium III chip extended this by introducing eight 128 bit registers which could be operated on by the SIMD instructions. The Motorola PowerPC 7400 chips used in Apple G4 computers also provide SIMD instructions, which can operate on multiple data items held in 32 128-bit registers. SIMD instructions have a huge impact on the processing of multimedia data, and this was the main selling point of the MMX upgrade to the Pentium chips. However, SIMD also improves performance on any type of processing which requires the same instruction to be applied to multiple data items. Other examples which benefit from SIMD include voice-to-text processing and data encryption and decryption.

4.5 Pipelining 4.5.1 The principles of pipelining The idea of instruction pipelining derives from the basic idea of an assembly line, where product assembly is split into a sequence of stages. By executing these stages in turn along an assembly line many products at different stages are being worked on simultaneously. Thus, products can be accepted on to the assembly line before previously submitted products are finished. This idea can be applied to the fetch-execute cycle in a processor. An instruction goes through several stages, and once the circuitry for one stage has finished its job, this circuitry is idle until the next instruction appears. Obviously if processing of the next instruction can be commenced before the previous instruction has completed its journey through the processor, then there is a possibility of increasing the throughput of the processor. For example, in a pipeline we could have three stages: fetch, decode and execute. Without a pipeline, the instructions would be treated in sequence, and the next instruction could not be dealt with at all until the previous instruction had finished, see Figure 4.1. With a pipeline, several instructions can be dealt with simultaneously, only having to wait until one stage of a instruction has been dealt with before moving onto the next, see Figure 4.2. If we assume that each of these stages takes a clock cycle, without a pipeline an instruction is executed every three cycles; with the pipeline an instruction still takes three cycles, but one instruction is completed each cycle, giving a three-fold increase in speed.

fetch decode execute

fetch decode execute

fetch decode execute

time

Figure 4.1: Execution of Instructions Without a Pipeline

­c HERIOT-WATT UNIVERSITY 2006 104 TOPIC 4. PROCESSOR PERFORMANCE

fetch decode execute

fetch decode execute depth of pipeline fetch decode execute

time

Figure 4.2: Execution of Instructions with a Three Stage Pipeline In fact this basic idea was applied in a simple form in the Intel 8086 processor. This processor had a 6-byte pre-fetch queue and, since instructions were 1-4 bytes in length, one or more instructions could reside in this queue. The fetch-execute cycle was split into a fetch phase and an execute phase which were independent of each other. While one instruction was executing another instruction was being loaded into the pre-fetch queue. When the execution unit became free, the next instruction was loaded into it from the pre-fetch queue, and the fetch unit would read the next instruction from memory and place it in the pre-fetch queue. If the phases were exactly equal in execution time, this gave the possibility of a two-fold increase in performance. In practice, execution of an instruction takes longer than the fetch phase so a doubling in execution speed is not achieved. The largest gain comes when phases take almost equal time. This has been achieved by splitting the fetch-execute cycle into a larger number of smaller phases which are comparable to each other in time of execution. For example, the Intel 80386 increased performance by incorporating a pipeline that had six sections which could all operate in parallel and had a 16-byte pre-fetch queue. The pipeline for the fetch-execute cycle was split as follows:

¯ Fetch instruction.

¯ Decode instruction.

¯ Calculate effective addresses (EA) of operands.

¯ Fetch operands.

¯ Execute instruction.

¯ Write result.

Table 4.2 shows the sequence of events in time as a program is executed. I1 is the 1st instruction, I2 is the next, and so on.

­c HERIOT-WATT UNIVERSITY 2006 4.5. PIPELINING 105

Table 4.2: Sequence of Events in a Pipeline

Calculate Fetch Time Fetch Decode Execute Write EA Operands 1 I1 2 I2 I1 3 I3 I2 I1 4 I4 I3 I2 I1 5 I5 I4 I3 I2 I1 6 I6 I5 I4 I3 I2 I1 7 I7 I6 I5 I4 I3 I2 If each stage in the pipeline takes only one clock cycle then, as soon as the pipeline becomes full, one instruction finishes execution every clock cycle. In this example, the pipeline fills at time 6. Theoretically a sixfold increase in speed is possible. In practice, this gain in performance will not be completely achievable. It is unlikely that each stage will take exactly the same time, and the pipeline can only move as fast as the slowest instruction, so there will be some waiting between one stage and the next. Some instructions will also depend upon information from previous instructions and this may also hold up the process. There is an online animation showing a 3 stage pipeline at this point. Pipelining is a feature of all modern processors. Although it requires more complex circuitry on the processor chip, the performance gains outweigh any disadvantages. Before pipelining was incorporated into processor designs, instructions would typically take many clock cycles to execute. With pipelining, speeds of 1 instruction per clock cycle are possible.

4.5.2 Problems with pipelining Pipelining works best when all instructions are of the same length, and follow in a direct sequence, one after another. Unfortunately, this is not always the case. In some processors, especially CISC-based designs, instructions can vary greatly in length. Some instructions may be able to be executed in a few clock cycles, while others may require 10, 20 or more clock cycles. This means that the pipeline cannot run smoothly. A long, slow instruction may hold up the pipeline. This is less of a problem with RISC-based processor designs, as most of the instructions are fairly short. Another problem with pipelining arises when one instruction depends on a result produced by the previous instruction. Data required for the 2nd instruction may not yet be available, because the 1st instruction is still being executed. This is called true data dependency. The pipeline must be stalled until the data is ready for the 2nd instruction. In an attempt to overcome this difficulty, compilers for processors which use pipelining, can re-order the instructions where dependency is discovered, so that dependent instructions are separated by other non-dependent instructions. The most serious problem which can arise with pipelining results from branch instructions. If a conditional branch instruction, e,g. BCC 25, is encountered, the next instruction to be carried out may be either

­c HERIOT-WATT UNIVERSITY 2006 106 TOPIC 4. PROCESSOR PERFORMANCE

¯ the next instruction in the program (if the carry flag is set in this example), or

¯ the instruction 25 bytes ahead (if the carry flag is clear in this example).

The first alternative is no problem, because the next instruction will be following the BCC 25 instruction through the pipeline. However if the branch is activated, all the instructions following BCC 25 through the pipeline will have to be discarded, and the pipeline will need to start again, This will mean several wasted clock cycles. The following two diagrams illustrate this for a set of instructions passing through a 6 stage pipeline. In Figure 4.4, instruction 3 is a branch instruction, which requires a jump to instruction 15. In this case, 4 clock cycles are wasted as the pipeline is "flushed". (The six stages in this pipeline are:

¯ FI - fetch instruction

¯ DI - decode instruction

¯ CO - calculate operand

¯ FO - fetch operand

¯ EI - execute instruction

¯ WO - write output)

­c HERIOT-WATT UNIVERSITY 2006 4.5. PIPELINING 107

No branches With conditional branch

FI DI CO FO EI WO FI DI CO FO EI WO Time I11 I1

I2 I12 I2 I1

I3 I2 I13 I3 I2 I1

I4 I3 I2 I14 I4 I3 I2 I1

I5 I4 I3 I2 I15 I5 I4 I3 I2 I1

I6 I5 I4 I3 I2 I16 I6 I5 I4 I3 I2 I1

I7 I6 I5 I4 I3 I27 I7 I6 I5 I4 I3 I2

I8 I7 I6 I5 I4 I38 I15 I3

I9 I8 I7 I6 I5 I49 I16 I15

I9 I8 I7 I6 I510 I16 I15

I9 I8 I7 I611 I16 I15

I9 I8 I712 I16 I15

I9 I813 I16 I15

I914 I16

Figure 4.3: The flowchart below shows the logic required by the processor to ensure branches are handled correctly in the pipeline.

­c HERIOT-WATT UNIVERSITY 2006 108 TOPIC 4. PROCESSOR PERFORMANCE

Fetch Instruction FI

Decode Instruction DI

Calculate Operands CO

Unconditional Branch?

Fetch Operand FO

Execute Instruction EI

Update Write PC Operands WO

Empty pipe Branch or interrupt?

Figure 4.4: Various techniques have been developed to overcome pipelining problems. These are discussed in the next sub-topic.

Q8: Name 3 types of problems that can arise during pipelining.

4.5.3 Optimising the pipeline A number of techniques have been developed to overcome the problems associated with pipelining. These techniques include

¯ branch prediction

¯ data flow analysis

¯ speculative loading of data

¯ speculative execution of instructions

¯ predication

­c HERIOT-WATT UNIVERSITY 2006 4.5. PIPELINING 109

Each of these is a complex technique, and we will only take a superficial look at how they operate. Branch Prediction The Motorola 68020 and VAX11/780 chips assume that the conditional branch will not be taken. Research shows that this is more often incorrect! It would seem sensible, therefore, to assume that the branch will be taken, and to begin fetching from the branch target. However, this may slow down execution as it may be in a different memory "page" or not in cache. Neither of these approaches is satisfactory. Some processors predict branch "taken" for some op-codes and "not taken" for others. The most effective approaches, however, use dynamic techniques. Many branch instructions are repeated often in a program (e.g. the branch instruction at the end of a loop). The processor can then note whether or not the branch was "taken" previously, and assume that the same will happen this time. This requires the use of a branch history table, a small area of cache memory, to record the information. This method is used in the AMD29000 processor. Data Flow Analysis Data flow analysis is used to overcome dependency. The processor analyses the instructions for any dependency, then allocates instructions to the pipeline in an order which prevents dependency stalling the flow of instructions through the pipeline. Speculative loading of data One interesting technique used in the Intel / HP IA64 architecture is called speculative loading. Essentially, the processor looks ahead, and executes early any instructions which load data from memory. The data is stored in registers ready to be used by later instructions if required. If the program branches and the data is not required, then it can be discarded. Speculative execution This technique involves the processor in carrying out instructions before they are required. The results of these instructions are stored in temporary registers. If they turn out not to be required, again they can be discarded. Predication Predication tackles conditional branches by executing instructions from both branches until it knows which branch is to be taken. Each branch uses a different set of registers within the processor to avoid confusion. Once the processor knows which branch is to be taken, it can discard the results of the "not taken" branch, and proceed with the "taken" branch. All of these techniques are inter-related, and have only become possible because of

¯ the increasing speeds

¯ the increasing complexity

¯ the increasing number of registers available in modern processors.

None of this would be possible in a simple processor such as the 6502.

­c HERIOT-WATT UNIVERSITY 2006 110 TOPIC 4. PROCESSOR PERFORMANCE

4.6 Superscalar Processing As we have seen, the introduction of pipelining has led to increases in performance. The next natural stage to this was to introduce more than one pipeline within the processor. A processor with more than one pipeline and whose pipelines can operate independently of each other is called a . Superscalar processors try to take advantage of instruction-level parallelism which is the degree to which instructions in a

program can be executed in parallel. For example if two instructions in a program were:

/  /·¾*

=  =·!*

then if two pipelines are available, these instructions could be executed in parallel giving a twofold increase in speed, as there is no data dependency.

In practice, this speed-up is often not attainable because of data dependencies. For

 /·!* example, if the second assignment above was = then it could not be executed

until the first statement had produced a value for /. Various techniques are used to ensure that all pipelines are kept full and operative a high percentage of the time. Such techniques involve executing instructions out of order, trying to predict which decision will be taken at branches and speculatively executing instructions ahead of the program counter in the hope that the results can be used later. In the Intel family of processors the first superscalar processor was the Pentium processor which incorporated a second integer arithmetic pipeline on the chip. As in the preceding Intel 80486DX processor a floating-point pipeline was provided on the chip. Previous Intel processors carried out floating-point arithmetic by using a separate off-chip floating-point co-processor. The advent of superscalar processors has allowed more than one instruction to be executed per clock cycle. For example, the Pentium Pro processor can execute three instructions per clock cycle. It has two floating-point pipelines, two integer pipelines and the L2 cache is closely coupled to the processor by a 64-bit cache bus which can support four concurrent accesses to the cache. Figure 4.5 shows a comparison of the performance gains provided by a 4 stage pipeline, and a superscalar architecture. In each case, it is assumed that there are no pipeline stalls due to branching or dependencies.

­c HERIOT-WATT UNIVERSITY 2006 4.7. SUMMARY 111

Figure 4.5:

4.7 Summary

¯ Four techniques developed to improve processor performance are – RISC (reduced instruction set) processors – SIMD – Pipelining – Superscalar processing

¯ comparing RISC and CISC microprocessor design, RISC processors – have a smaller instruction set – have more internal registers – have fewer addressing modes

¯ SIMD (single instruction multiple data) instructions leads to a performance gain, as many items of data can be operated on at the same time

¯ SIMD is particularly useful in multimedia processing

¯ Pipelining improves performance by implementing different stages of the fetch- execute cycle at the same time on different instructions, like an assembly line in a factory

­c HERIOT-WATT UNIVERSITY 2006 112 TOPIC 4. PROCESSOR PERFORMANCE

¯ Processors which use pipelining can achieve a throughput of 1 instruction per clock cycle, even when each instruction might take several clock cycles

¯ All modern processors operate pipelining, with 5 or more stages

¯ A pipeline might stall due to a conditional branch instruction, as the execution may jump to another instruction, rather than the one next in the pipeline

¯ Techniques used to optimise the instruction and data streams, include

– branch prediction – data flow analysis – speculative loading of data – speculative execution of instructions – predication

¯ superscalar processing involves processors having multiple pipelines

4.8 Revision Questions Q9: RISC processors

a) have a larger instruction set than CISC processors, so improve performance b) have a smaller, but more efficiently executed, instruction set, than CISC processors c) have a reduced number of internal registers compared with CISC processors d) have a greater number of addressing modes than CISC processors

Q10: SIMD processing

a) applies many instructions to a single item of data simultaneously b) only improves performance in multimedia processing c) applies the same instruction to many items of data simultaneously d) only gives a performance gain if there are no conditional branch instructions

Q11: Pipelining

a) is a system which uses multiple processors to improve performance b) involves a processor dealing with several instructions simultaneously in stages c) is a technique only used on specialised high performance computers d) always results in the execution of 1 instruction per clock cycle

Q12: A branch instruction may cause the pipeline to stall because

a) it may result in a jump to an instruction which is not next in the pipeline b) branch instructions are of a different length to other instructions c) the pipeline cannot break a branch instruction into enough stages d) branch instructions take more than 1 clock cycle to execute

Q13: A branch history table is required for

a) speculative loading of data

­c HERIOT-WATT UNIVERSITY 2006 4.8. REVISION QUESTIONS 113

b) predication c) branch prediction d) superscalar processing

Q14: Superscalar processing a) is only possible for RISC-based processors b) is a technique used to avoid problems with branch instructions in a pipeline c) involves a processor being able to operate two or more pipelines simultaneously d) is only useful to improve the processing of multimedia instructions

­c HERIOT-WATT UNIVERSITY 2006 114 TOPIC 4. PROCESSOR PERFORMANCE

­c HERIOT-WATT UNIVERSITY 2006 115

Topic 5

Processor Development

Contents

5.1 Prior Knowledge and Revision ...... 117 5.2 Introduction ...... 117 5.3 The Intel X86 Series ...... 119 5.3.1 The 8086 processor ...... 120 5.3.2 From 80286 to 80486 ...... 120 5.3.3 The Pentium range ...... 123 5.3.4 X86 series evolution ...... 125 5.3.5 X86 Addressing modes and instruction set ...... 127 5.3.6 Review Questions on X86 series ...... 128 5.4 The PowerPC series ...... 128 5.4.1 PowerPC background ...... 129 5.4.2 Power PC overview ...... 130 5.4.3 PowerPC series evolution ...... 132 5.4.4 Review Questions on PowerPC series ...... 133 5.5 The Intel IA-64 ...... 134 5.5.1 IA-64 background ...... 134 5.5.2 VLIW ...... 135 5.5.3 The Itanium Processor ...... 137 5.5.4 Questions on the IA-64 Architecture ...... 137 5.6 Parallel Computing ...... 138 5.6.1 Introduction ...... 138 5.6.2 Multiprocessing, mainframes and supercomputers ...... 138 5.6.3 Massively Parallel Architectures ...... 140 5.7 Summary ...... 143 5.8 Revision Questions ...... 144

Learning Objectives After studying this topic, you should be able to:

¯ describe the evolution of the Intel X86 series of microprocessors

¯ describe the evolution of the PowerPC series of microprocessors 116 TOPIC 5. PROCESSOR DEVELOPMENT

¯ describe the evolution of the IA-64 microprocessor

¯ relate the following features and techniques to the evolution of these microprocessor architectures

– clock speeds – data bus widths – pipelining and related techniques – superscalar processing – number and function of registers – RISC and CISC

¯ describe how these developments have improved system performance

¯ describe how parallel computers function, including use of local memory, pipelining, local pathways and communication between processors

¯ describe the performance benefits of parallel computers

­c HERIOT-WATT UNIVERSITY 2006 5.1. PRIOR KNOWLEDGE AND REVISION 117

5.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ Describe the structure of a simple microprocessor

¯ Explain what is meant by the instruction set of a microprocessor

¯ Explain what is meant by addressing modes

¯ Describe the key difference between RISC and CISC designs

¯ Explain how pipelining operates

¯ Explain what is meant by superscalar processing

Revision Q1: A CISC-based microprocessor a) has a smaller instruction set than a RISC-based microprocessor b) has more addressing modes than a RISC-based microprocessor c) has less addressing modes than a RISC-based microprocessor d) has more internal registers than a RISC-based microprocessor

Q2: A processor is able to operate several instruction pipelines simultaneously. It must be a) a RISC processor b) a CISC processor c) a superscalar processor d) an SIMD processor

Q3: Immediate, absolute and implied are examples of a) branch prediction b) instruction sets c) addressing modes d) speculation

Q4: Pipelining improves performance because a) it uses multiple processors b) it allows processors to run at higher clock speeds c) it uses a wider data bus d) it splits instructions into stages which allows more than one instruction to be processed at the same time

5.2 Introduction We have studied the general architecture of microprocessors, using the 6502 as an example. Although the 6502 is still used in some embedded systems, it is no longer

­c HERIOT-WATT UNIVERSITY 2006 118 TOPIC 5. PROCESSOR DEVELOPMENT

relevant to personal computers. Since the 1980s, microprocessor architecture has developed rapidly, as a result of

¯ the increasing miniaturisation of microelectronic circuitry, which means that more and more complex chip designs have become possible and economically viable

¯ the pressure from software developers to design microprocessors with ever increasing performance

We have considered some of the design features which have been developed and incorporated into modern microprocessors. These changes have been evolutionary in nature, so that it is possible to follow developments by considering the series of microprocessors from two of the major manufacturers, Intel and Motorola. These have formed the basis of the Windows PC and Apple Macintosh range of computers respectively. Although the latest versions of these microprocessor series are hugely more complex than the humble 6502, the basic principles that we have studied still apply. Before we study each of these series, we will start by taking a look at the early history of processor design , from the Intel 4004 in 1971 through to the 6502 in 1975. The first microprocessors were not general-purpose processors but were designed for specific applications. One of the earliest of these was the Intel 4004. The Intel 4004 (1971) was a 4-bit processor with an accumulator and a 4-bit parallel adder. The processor chip was designed as part of a family of four chips which could be built into calculators, data terminals, numeric control systems, etc. While it processed data in 4-bits, its instructions were 8-bits long. Program and data memory were kept separate, there being 1 Kbyte of data memory and 4 Kbytes of instruction memory which was accessed by a 12-bit Program Counter. It had sixteen 4-bit general-purpose registers which could be treated as eight 8-bit registers. It ran at a clock rate of 740 kHz and needed eight clock cycles for a complete cycle through the CPU giving an instruction time of 10.8 microseconds. There were about 45 instructions in its instruction set. The Intel 4004 was superseded by a faster and more complex 8-bit processor, the Intel 8008, in 1972. The next major step was the introduction of the Intel 8080, which was designed as a complete CPU for a general-purpose computer. The Intel 8080 (1974) had a 16-bit address bus and an 8-bit data bus and was basically an 8-bit processor. The Program Counter was sixteen bits long and there were seven 8-bit general-purpose registers. Some of the general-purpose registers could be used in pairs to give 16-bit registers. This processor was used in what is sometimes called the ‘first personal computer’, the Altair 8800. Various other 8-bit processors were popular at this time these included the Zilog Z-80 and the Motorola/MOS 6502. In 1976 the Zilog Corporation, which was formed by ex-Intel engineers, produced the Zilog Z-80 processor. This processor was an improved version of the Intel 8080 which could accept all the 8080 instructions but also had about 80 more of its own. It also had an 8-bit data bus and a 16-bit address bus. The register set was extended to sixteen 8- bit general-purpose registers divided into two sets, only one of which could be used at a

­c HERIOT-WATT UNIVERSITY 2006 5.3. THE INTEL X86 SERIES 119

time. Each set had its own designated Accumulator and Flag Register and the operating system could switch between register sets very fast. Pairs of registers could hold 16-bit memory addresses or 16-bit operands for double precision calculation. There were also two Index Registers. Initial processors ran at 2.5 MHz but some later designs ran at speeds as high as 10 MHz. The Z-80 was used in many small computers aimed at the business market. One of the first processors to be used in home computers was the 6502 (1975) which was designed by Motorola engineers who left and formed a company called MOS Technologies (later bought by Commodore). The 6502 was an 8-bit processor with an 8- bit data bus and a 16-bit address bus. It had a minimal register set, an 8-bit Accumulator, two 8-bit Index Registers, an 8-bit Stack Pointer and an 8-bit Status Register. The Program Counter was sixteen bits. Consequently this processor was relatively cheap and was used in early 8-bit home computers such as the Commodore PET, the Apple II and the Atari. The step up to 16-bit general purpose processors happened in the late 1970s with the introduction of the Intel 8086.

5.3 The Intel X86 Series The Intel 8086/88 (1979) was the processor of choice for the IBM PC which has led to the increasingly powerful and all-pervasive PC. Its design was based on the Intel 8080 and later 8086 processors. The processor had 16-bit data and address buses and allowed fetch and execute to be concurrent. This was accomplished by the use of a pre-fetch queue which could be filled during a fetch phase while an instruction was being executed. There were four 16-bit general-purpose registers and four 16-bit Index Registers. Four Segment Registers allowed addresses to be extended to 20 bits, giving access to a megabyte of memory in 64 Kbyte chunks. As a consequence any data structure was limited to a maximum size of 64 Kbytes. Early processors ran at 4.7 MHz though later models increased this. Intel has spawned from this early beginning the increasingly influential series of processors from the 80286, 80386, 80486 through to the Pentium I, II and III. These processors have been developed from being 16-bit processors to 32-bit and ultimately 64-bit.

­c HERIOT-WATT UNIVERSITY 2006 120 TOPIC 5. PROCESSOR DEVELOPMENT

5.3.1 The 8086 processor Table 5.1 lists the main features of the Intel 8086 processor. This processor has had a great influence on the Intel range of processors and has had an extensive (though not always good!) influence on the growth of the personal computer.

Table 5.1: Features of the Intel 8086 processor Processor 8086 Type 16-bit Year introduced 1979 Data/Address bus width 16/16 bits Addressable memory 1 Mbyte by use of 20-bit address computed from Segment Register and 16-bit address. System bus speed (MHz) 4.7-10 Internal bus speed (MHz) 4.7-10 Registers 4 GP registers, 5 Pointer and Index Registers, 4 Segment Registers and 1 Flags Register. All 16 bits wide. Cache None Features 4 extra bits for address available on 4 extra pins on chip, extra bits were generated by the use of 4 16-bit segment registers. Each segment could only hold 64 Kbytes, thus limiting any structure to this size. Pre- fetch buffer (6 bytes) used to allow concurrent fetch and execute. As personal computers and their operating systems advanced in sophistication the memory limitation to 1 Mbyte of the 8086 processor became a major drawback. The memory demands of individual programs were increasing and multi-tasking operating systems were being introduced that would allow several programs to execute simultaneously. This required a larger physical address space and also required programs to use a far larger virtual address space. Protection mechanisms were also required to ensure that one program could not affect the code or data of another program. The Intel 80286 processor was designed to meet these needs.

5.3.2 From 80286 to 80486 The segment registers now pointed to tables of 24-bit descriptors which held base- addresses of segments. Physical addresses were then calculated from the 24-bit base-address plus the operand address within the segment. This allowed the physical addressing of 16 Mbytes of memory though segments were still restricted to 64 Kbytes. To cope with these larger addresses, the address bus was extended to 24 bits. Various protection mechanisms were also incorporated.

­c HERIOT-WATT UNIVERSITY 2006 5.3. THE INTEL X86 SERIES 121

Table 5.2: Main features of the 80286 processor Processor 80286 Type 16-bit Year introduced 1982 Data/Address bus width 16/24 bits Addressable memory 16 Mbytes by use of 24-bit addresses. The Segment Registers were used to access descriptor tables which held 24-bit base-addresses to which the operand address was added. System bus speed (MHz) 6-20 Internal bus speed (MHz) 6-20 Registers 4 GP registers, 5 Pointer and Index Registers, 4 Segment Registers and 1 Flags Register. All 16 bits wide. Cache None Features Increased addressable memory because of the extra width in the address bus. Also the use of descriptor tables made it easier for the system to implement virtual memory which was vital for multi-tasking operating systems.

The next processor in the range, the 80386DX, was a full 32-bit processor with 32-bit registers and 32-bit data and address lines. The original 16-bit registers of the 8086 and 80286 were still accessible as the lower half of each 32-bit register. This allowed backward compatibility. Physical and Virtual memory up to 4 Gbytes could now be addressed. Instructions using 32-bit operands and addresses were introduced. This was the first Intel processor to introduce a number of parallel stages into the processor pipeline. The pipeline was split into six stages which could all operate in parallel and the size of the pre-fetch queue was increased to 16 bytes. The main features are shown in Table 5.3.

­c HERIOT-WATT UNIVERSITY 2006 122 TOPIC 5. PROCESSOR DEVELOPMENT

Table 5.3: Main features of the 80386DX processor Processor 80386DX Type 32-bit Year introduced 1985 Data/Address bus width 32/32 bits Addressable memory 4 Gbytes by use of 32-bit addresses. System bus speed (MHz) 16-33 Internal bus speed (MHz) 16-33 Registers 4 GP registers, 5 Pointer and Index Registers, 6 Segment Registers and 1 Flags Register. All 32 bits wide except segment registers which remain at 16 bits wide. Cache None Features Increased addressable memory because of the 32-bit address bus. New instructions for 32-bit operands and 32-bit addressing modes. There was now no limit on the size of programs or data structures. Pipeline split into six stages which could all operate in parallel and a 16 byte pre-fetch queue was used. The 80486DX which was introduced in 1989, made several important advances. The pipeline was further split into smaller stages so that each stage could carry out its task in 1 clock cycle. This allowed execution rates of 1 instruction per clock cycle. Also, to help feed the pipeline at a sufficiently high rate, an 8 Kbyte cache was included on the processor chip. The cache size was increased to give an 8 Kbyte data cache and an 8 Kbyte program cache in the 80486DX4. Previous processors carried out floating-point operations either in software or by passing the operands to a separate floating-point co-processor. The 80486DX incorporated the floating-point processor and its registers onto the same chip as the CPU which increased floating-point speeds considerably. The important concept of "clock doubling" was introduced in the later 80486DX processors. The idea is that the CPU runs internally at a faster speed than the external electronics. Doubling of the internal (to the CPU) clock-speed was introduced in the 80486DX2. The 80486DX4 actually trebled the internal clock, despite its name! This is summarised in Table 5.4.

­c HERIOT-WATT UNIVERSITY 2006 5.3. THE INTEL X86 SERIES 123

Table 5.4: Main features of the 80486X processor Processor 80486DX Type 32-bit Year introduced 1989 Data/Address bus width 32/32 bits Addressable memory 4 Gbytes by use of 32-bit addresses. 25-50. In the 80486DX2 (1992) and the 80486DX4 System bus speed (MHz) (1994) 25-40. 25-50. In the 80486DX2 clock doubled to 50-80. In Internal bus speed (MHz) the 80486DX4 clock trebled to 75-120. Registers 4 GP registers, 5 Pointer and Index Registers, 6 Segment Registers and 1 Flags Register. All 32 bits wide except for Segment Registers which remain at 16 bits wide. Eight 80-bit floating-point registers. Cache 8 Kbytes on chip but 8 Kbytes data and 8 Kbytes program on 80486DX4. Features Further subdivision of the pipeline stages to achieve one instruction per clock cycle. An 8 Kbyte cache was added to the processor chip. Floating point unit integrated onto the same chip as the CPU with its own floating-point registers. Internal clock-doubling and trebling introduced in the 80486DX2 and the 80486DX4 respectively.

5.3.3 The Pentium range The Intel architecture went superscalar with the introduction of the Pentium processor (1993). The chip includes two integer arithmetic and logic units and a floating point unit. This allows two instruction pipelines to execute concurrently so two instructions can be executed per clock cycle. The floating-point unit can accept one floating-point instruction per clock cycle and has eight 80-bit registers connected by an 80-bit internal bus. Separate 8 Kbyte data and program caches were added on chip. The program cache is connected via a 256-bit internal bus while other internal buses are 128 bits wide to speed up internal data transfers. The external data bus was also increased to 64 bits wide.

­c HERIOT-WATT UNIVERSITY 2006 124 TOPIC 5. PROCESSOR DEVELOPMENT

Table 5.5: Main features of the Pentium processor Processor Pentium Type 32-bit Year introduced 1993 Data/Address bus width 64/32 bits Addressable memory 4 Gbytes by use of 32-bit addresses. System bus speed (MHz) 66 Internal bus speed (MHz) 60-200 Registers 4 GP registers, 5 Pointer and Index Registers, 6 Segment Registers and 1 Flags Register. All 32 bits wide except for segment registers which remain at 16 bits wide. Eight 80-bit floating-point registers. Cache 8 Kbyte data and 8 Kbyte program on chip. Features Data bus increased to 64 bits wide. Superscalar with two integer and one Floating-point pipeline. Internal busses of 128 and 256 bit widths to speed internal data movements. Separate 8 Kbyte data and program caches. The Pentium Pro (1995) incorporated more parallelism than the Pentium. It had three instruction-decode units feeding five parallel execution units (two integer, two floating- point and one memory interface unit). This allowed it to achieve execution rates of three instructions per clock cycle. A second-level (L2) cache of 256 Kbytes was added which was closely coupled to the CPU by a 64-bit ‘backside’ bus. The address bus width was increased to 36 bits giving a maximum physical address space of 64 Gbytes. This is summarised in Table 5.6.

Table 5.6: Main features of the Pentium Pro processor Processor Pentium Pro Type 32-bit Year introduced 1995 Data/Address bus width 64/36 bits Addressable memory 64 Gbytes by use of 36-bit addresses. System bus speed (MHz) 66 Internal bus speed (MHz) 150-200 Registers 4 GP registers, 5 Pointer and Index Registers, 6 Segment Registers and 1 Flags register. All 32 bits wide except for segment registers which remain at 16 bits wide. Eight 80-bit floating-point registers. Cache 8 Kbyte data and 8 Kbyte program on chip and a closely coupled 256 Kbyte secondary cache. Features Superscalar with two integer units, two floating-point units and a memory interface unit. There is a secondary (L2) cache closely coupled by a 64-bit bus to the CPU. An expanded 36-bit address bus.

­c HERIOT-WATT UNIVERSITY 2006 5.3. THE INTEL X86 SERIES 125

Since the Pentium Pro the range has been further extended by the introduction of the Pentium II and subsequently the Pentium III. The trends above have continued. Sophisticated techniques have been used for branch prediction, speculative execution, etc. to increase the degree of parallelism. Cache sizes have increased. In the Pentium II new Multi Media eXtensions (MMX) instructions and hardware extensions were included to speed up the processing of video, audio and graphics data. In the Pentium III, new floating-point instructions were added to support three-dimensional graphics software. The 100 MHz PCI bus was also introduced in the Pentium II. Meanwhile advances in chip fabrication technology have allowed internal clock speeds of over 1 GHz to be attained. The MMX introduced the idea of a single instruction being applied to several pieces of data simultaneously - known as Single Instruction Multiple Data (SIMD). Many of the SIMD extensions were intended to improve floating-point performance and in combination have allowed up to four floating-point results to be returned in each cycle of the processor. The Pentium IV, introduced in 2000, has a more extensive pipeline (20 stages), the ALU’s run at twice the processor clock speed and there are both integer and floating- point extensions to SIMD instructions. Also the system bus speed is 400 MHz compared with the previous Intel maximum of 133 MHz.

5.3.4 X86 series evolution The table below summarises the development of the Intel X86 series from the 8088 through to the Pentium III, showing how they developed in evolutionary stages, with

¯ Increasing register and data bus width

¯ Increasing addressing capability (as a result of address bus width)

¯ Increasing clock speed

Table 5.7: Processor Register Addressing Address Data Bus Max. Clock width capability Space Width Speed 8088 16 bits 20 bits 1MB 8 bits 10 MHz 8086 16 bits 20 bits 1MB 16 bits 10 MHz 80286 16 bits 24 bits 16 MB 16 bits 16 MHz 80386SX 32 bits 24 bits 16 MB 16 bits 25 MHz 80386DX 32 bits 32 bits 4GB 32 bits 40 MHz 80486SX 32 bits 32 bits 4GB 32 bits 33 MHz 80486DX 32 bits 32 bits 4GB 32 bits 100 MHz Pentium 32 bits 32 bits 4GB 64 bits 166 MHz Pentium Pro 32 bits 36 bits 64 GB 64 bits 200 MHz Pentium II 32 bit 36 bits 64 GB 64 bits 600 MHz Pentium III 32 bits 36 bits 64 GB 64 bits 733 MHz At the same time, the register structure and instruction set also developed. However, the original register structure and instruction set were retained as a subset, to ensure backward compatibility of software.

­c HERIOT-WATT UNIVERSITY 2006 126 TOPIC 5. PROCESSOR DEVELOPMENT

These diagrams shows the development of registers within the X86 processors: 8086: 8 x 16-bit registers 16 bits 88 Accumulator AH AL Base BH BL Count CH CL Data DH DL Base Pointer BP Stack Pointer SP Source Index SI Destination Index DI

Figure 5.1: 80286: same 8 registers as 8086, but all widened to 32 bits, with the original 16-bit registers still accessible 16 bits 88 Accumulator EAX AH AL AX Base EBX BH BL BX Count ECX CH CL CX Data EDX DH DL DX Base Pointer EBP BP Stack Pointer ESP SP Source Index ESI SI Destination Index EDI DI 80286 and above all X86 CPUs 32 bits

Figure 5.2:

­c HERIOT-WATT UNIVERSITY 2006 5.3. THE INTEL X86 SERIES 127

80486: a further set of 8 x 80-bit floating point registers added

Figure 5.3: Pentium 3: a further set of 8 x 128-bit registers added for SIMD instructions, but all earlier registers still maintained for compatibility:

Figure 5.4:

5.3.5 X86 Addressing modes and instruction set The X86 series has a large instruction set. In total, there are 235 different op-codes, ranging in length from 1 byte to 15 bytes. When we studied the 6502, we met the idea of addressing modes. The same instruction can be used in different ways, depending whether the operand is data (immediate addressing) or an address (direct addressing). The X86 series has a wide range of addressing modes (11 in total), including:

¯ Register mode

¯ Immediate mode

¯ Direct addressing mode

¯ Register deferred addressing mode

¯ Base addressing mode

­c HERIOT-WATT UNIVERSITY 2006 128 TOPIC 5. PROCESSOR DEVELOPMENT

¯ Indexed addressing mode

¯ Base indexed addressing mode

You don’t need to remember or understand these modes. The X86 series of microprocessors can be characterised as having:

¯ a relatively small number of registers (8 GP, 8 FP and 8 SIMD)

¯ a large instruction set

¯ instructions of varying length

¯ many addressing modes

These characteristics are typical of CISC (complex instruction set computer) architecture. Other CISC based processors include the IBM 370 and the VAX11/780.

5.3.6 Review Questions on X86 series Q5: Sketch a graph of the increase in clock speeds from the 8086 to the Pentium processor Q6: Which of the X86 processors was the first to use pipelining to improve performance? Q7: How many registers has the (a) 8086, (b) 80286, (c) 80486 (d) Pentium Q8: Which X86 chip was the first to have a superscalar architecture? Q9: The X86 series are considered to be CISC processors. Justify this claim.

5.4 The PowerPC series The Intel X86 series were the processors on which the so-called IBM-compatible desktop computers were based. Originally, these ran under the MS-DOS operating system. As the processors improved, graphical user interfaces became possible, and the Windows operating systems replaced DOS. The first versions of Windows were little more than a very rough graphical interface to DOS. Improvements in processor capability and operating system developed, and the Wintel PC was born, which still dominates the laptop and desktop market. At the same time, Motorola was developing its own family of microprocessors, the 68000 series. Unlike the X86 series, these were designed as 32-bit processors from the start. As a result, Apple was able to develop its Macintosh computers with a true graphical operating system from the start. The first member of the family was the Motorola 68000 (1979). This processor was an 8 MHz processor that was a 32-bit architecture internally but was served by a 16-bit data bus and a 24-bit address bus. The processor had sixteen 32-bit registers, split into eight data registers and eight address registers. All addresses were stored as 32-bits

­c HERIOT-WATT UNIVERSITY 2006 5.4. THE POWERPC SERIES 129

but the upper 8-bits were ignored in the 68000. Programming was much easier since direct addressing could be used to access up to 224, or 16 Mbytes, of memory directly with no requirement for segment registers. Also there was no 64 Kbyte restriction on the size of directly addressed data structures. As in the 8086, the 68000 could fetch the next instruction during instruction execution. The next major successor to the 68000 was the 68020 (1984) which expanded the external address and data buses to 32-bits wide so that the processor became a full 32-bit processor. The pipeline was expanded to a simple 3-stage pipeline and a 256 byte cache was added. In 1991 the 68040 was introduced with added 4 Kbyte caches for data and program, expanded the pipeline to six stages and brought the floating-point unit onto the chip. The design was expanded to a superscalar version by the introduction of the 68060 (1994). This had three execution units, two integer and one floating point. The pipeline was expanded to 10 stages and the data and program caches became 8 Kbytes each. The 68000 processor and its successors were used in Sun , Apple Macintosh computers and in later Atari computers. Eventually they fell out of use in the main computer market but are still used in various forms in the embedded systems market. Table 5.8 summarises some of the main characteristics of the family.

Table 5.8: Main characteristics of the Motorola 68000 family of processors External Bus Data bus Address Bus Address Processor Cache Speed width Width Space 68000 8-20 MHz 32 bit 32 bit 16 Mbytes none 68020 12-33 MHz 32 bit 32 bit 4 Gbytes 256 bytes 68040 25-40 MHz 32 bit 32 bit 4 Gbytes 4 Kbytes

68060 50-75 MHz 32 bit 32 bit 4 Gbytes 2¢4 Kbytes

5.4.1 PowerPC background In the final years of the 68000 processors, Apple, Motorola and IBM defined a specification for open system software and hardware, and Motorola and IBM designed the first PowerPC chip to meet this specification. PowerPC is an acronym for "performance optimised with enhanced RISC", and from the initial design, the PowerPC chips have reflected the RISC design features. Compared to the CISC-based X86 series, the PowerPC chips have:

¯ more registers

¯ a smaller, but more efficient, instruction set

¯ less addressing modes

The first PowerPC chip was the 601, released in 1993. It was a 32-bit chip (although it had a 64-bit data bus), with a clock speed of 60MHz, and able to address up to 4Gb of memory. It has a superscalar architecture, with 3 independent execution units - for

­c HERIOT-WATT UNIVERSITY 2006 130 TOPIC 5. PROCESSOR DEVELOPMENT

integer, floating point and branch processing - each with a 6 stage pipeline. It has 32K L1 (on chip) cache, and 512K L2 (off-chip) cache. Over the 2 years from 1994 to 1996, it formed the basis of the Apple PowerMac series of computers, with speeds increasing from the original 60MHz up to 120MHz. A block diagram of the PowerPC 601 shows clearly its superscalar architecture, see Figure 5.5.

Front End

Instruction Fetch BU

Branch Unit Instruction Queue

Decode / Dispatch

FPU-1 IU-1 SU FPU-2 Floating- Integer System point Unit ALU Unit

Scalar Arithmetic-Logic Units Execution Core

Figure 5.5: You can read more about the PowerPC 601 at: http://www.cs.camosun.bc.ca/courses/comp182/barker/notes/ppc601.html The PowerPC 601 was the first in a series of microprocessors, which have developed through the 1990s. The current version is the PowerPC 970, which is the basis of the Apple G5 computers.

5.4.2 Power PC overview Unlike the CISC based X86 processors, which started with 8 registers, increasing to 24 in the Pentium chip, all PowerPC processors have 2 sets of 32 programmer accessible general purpose registers, each 64 bit wide:

­c HERIOT-WATT UNIVERSITY 2006 5.4. THE POWERPC SERIES 131

Figure 5.6: In addition, there are a small number of special purpose registers. The instruction set is similar in size to the X86 instruction set: the PowerPC has 225 instructions compared to the X86’s 235. However, all the PowerPC instructions are exactly 4 bytes long. This means they can be executed efficiently, and are ideally suited for pipelining. Another contrast is in terms of addressing modes; the X86 had 11 addressing modes, while the PowerPC has only 2. The PowerPC uses direct addressing with load, store and branch instructions. All other instructions can only address data that is in an internal register. A comparison table highlights these differences:

X86 series PowerPC series GP registers 8to24 64 Instruction set 235 225 Instruction length 1to11 4 Addressing modes 11 2 The X86 is a typical CISC-based microprocessor; the PowerPC is a typical RISC-based microprocessor.

­c HERIOT-WATT UNIVERSITY 2006 132 TOPIC 5. PROCESSOR DEVELOPMENT

5.4.3 PowerPC series evolution Just like the Intel X86 series, the PowerPC series of chips have evolved continuously over the last 10 years. This development is outlined in Table 5.9.

Table 5.9: Chip (Year) Clock Bus Speed Transistors On-chip New Speed Cache features 601 (1994) 50-120 MHz 25-60 MHz 2.8 million 32K L1 32 bit architecture, 3way superscalar 603e (1995) 75-200 MHz 33-50 MHz 1.6 million 16K + 16K 5way L1 superscalar, separate instruction and data caches 604e (1996) 166-233 33-50 MHz 5.1 million 32K + 32K Faster MHz L1 version of 603e, with larger caches 740/750 / G3 233-800MHz 100-200 21.5 million 32K + 32K High (1997) MHz L1, 256K L2 speed L2 "backside" on-chip cache 74xx series / 350 MHz - 100-166 10.5-33 32K + 32K 64 bit G4 (1999) 1.5 GHz MHz million L1, 256K L2 ALU, SIMD processing, 128-bit vector processing unit (AltiVec), multi- processing supports 970 / G5 1.8-2.5 GHz 333MHz - 58 million 32K + 32K 6way (2003) 1.25 GHz L1, 512K L2 superscalar architecture + AltiVec unit, 512K L2 cache on chip The actual values are not as significant as the trends.

­c HERIOT-WATT UNIVERSITY 2006 5.4. THE POWERPC SERIES 133

A brief scan of the table shows that

¯ clock speeds have increased by a factor of 50 in 10 years

¯ bus speeds have increased by a factor of 20

¯ the complexity (no. of transistors) has increased by a factor of 20

¯ on chip cache has increased

¯ new features have been added.

All of these factors have contributed to a massive growth in performance, which shows no sign of stopping. During the 1990s and into the early years of the 21st century, the evolution of the X86 and PowerPC chip architectures developed simultaneously. The RISC based design gave the PowerPC-based (Apple) systems an early advantage, but the X86-based Wintel PCs caught up with massive clock speeds. An interesting controversy arose due to an over-simplistic misunderstanding of processor data. Around the year 2000, Wintel PCs were boasting clock speeds of well over 1GHz, while the Apple Macs of the time were typically well below 1 GHz. This led to a myth that the Macs were inferior. However, due to their more efficient RISC based chips, the actual performance was very similar. Clock speed alone is a very poor indicator of computer system performance. PowerPC chips are not limited to Apple computers. The Power 4+ chip, for example, is used widely in IBM servers. Finally, it should be noted that a number of variants of the PowerPC architecture are used in embedded systems, including:

¯ "", used in the Nintendo Game Cube

¯ IBM 400 series, used in a variety of embedded applications

¯ Motorola MPC 8xx, 5xx, 82xx, 85xx, which are used in a range of automotive, industrial and telecommunications embedded applications

5.4.4 Review Questions on PowerPC series Q10: Which 3 companies cooperated in the design of the PowerPC specification?

Q11: What was the first PowerPC chip released, and when?

Q12: The 601 chip can be described as superscalar. How is this justified?

Q13: How many programmer accessible registers are there in all PowerPC chips?

Q14: Compare the X86 and PowerPC architectures in terms of

a) instructions set b) instruction length c) addressing modes

­c HERIOT-WATT UNIVERSITY 2006 134 TOPIC 5. PROCESSOR DEVELOPMENT

Q15: What new feature did the G3 chip have which improved performance?

Q16: Which was the first PowerPC chip to have SIMD instructions?

a) 601 b) 604e c) G3 d) G4 e) G5

Q17: Why is clock speed not a good way of comparing a Windows PC with a Apple Macintosh?

Q18: Other than in Apple computers, what are PowerPC chips used for?

5.5 The Intel IA-64 5.5.1 IA-64 background The X86 series reached its peak with the Pentium 3, Pentium 4 and Athlon processors. These are essentially CISC processors, using pipelining and superscalar processing, but with some RISC-like features. In 1994, Intel and HP began work on designing a new 64-bit architecture to replace the X86 series. The new IA-64 design uses a combination of RISC and CISC features, and is given the description EPIC - explicitly parallel instruction computing. There are 4 key features to the design:

¯ instruction level parallelism - the compiler creates code which uses the many parallel execution units of the processor

¯ use of VLIW - very long instruction words

¯ use of predication - executing both branches of a program, then discarding the "not chosen" branch results

¯ use of speculative loading - use of large fast cache to load data and instructions in advance of when they will be required

Intel and HP predict that the IA-64 architecture will have a 25 year lifespan, evolving and improving as did the X86 series.

­c HERIOT-WATT UNIVERSITY 2006 5.5. THE INTEL IA-64 135

X86 IA-64 A B ABC C

CPU CPU Uses complex, variable length Uses simpler, fixed-length instructions processed one at a instructions bundled together in time. groups of three

A C B A C B

CPU CPU Reorders and optimises the Reorders and optimises the instruction stream at run time. instruction stream at compile time.

? ?

Tries to predict which way branches Whenever practical, speculatively will fork, and speculatively executes executes instructions along both instructions along the predicted path. paths of a branch and then discards the result it doesn't need. E E D D C C B B A A

CPU CPU Loads data from memory only when Speculatively loads data before it's needed, and tries to find the data in needed, and still tries to find the data the caches first. in the caches first.

5.5.2 VLIW The IA-64 architecture makes use of VLIW - very long instruction words. Instructions are fetched from memory in a bundle 128 bits long. This bundle contains 3 instructions, each of length 41 bits. The final 5 bits are a pointer, which indicates to the processor to which of the many execution units each instruction should be assigned.

­c HERIOT-WATT UNIVERSITY 2006 136 TOPIC 5. PROCESSOR DEVELOPMENT

128 bits

41 bits 41 bits 41 bits 5 bits

127 87 86 46 45 5 4 0

Instruction 3 Instruction 2 Instruction 1 Pointer

The IA-64 architecture has 4 types of execution units:

¯ I-unit (integer and logical operations)

¯ M-unit (load and store operations)

¯ B-unit (branch operations)

¯ F-unit (floating point operations)

The 5 bit pointer (or "template") allows 32 different combinations of destination units for the 3 instructions. For example, a pointer 00000 means "send instruction 1 to the M- unit, send instruction 2 to the I-unit, and send instruction 3 to another I-unit". A pointer 11101 means "send instruction 1 to the M-unit, send instruction 2 to the F-unit, and send instruction 3 to the B-unit". The pointer is created by the compiler, which determines in advance whether or not instructions can be executed in parallel. When the instruction bundle arrives at the processor, the 3 instructions are directed to the appropriate execution unit for processing:

Compiler

Processor Instruction scheduler

Functional Units

The use of VLIW reduces the number of relatively slow memory fetches, and so enhances performance. Performance is also enhanced because the sequencing of instructions is predetermined by the compiler, rather than being dealt with at run-time.

­c HERIOT-WATT UNIVERSITY 2006 5.5. THE INTEL IA-64 137

The first commercial version of the IA-64 architecture is the Itanium chip.

5.5.3 The Itanium Processor The Itanium processor is the first incarnation of the IA-64 architecture. It has 11 execution units:

¯ 4 integer units

¯ 2 floating point units

¯ 3 branch units

¯ 2 load/store units making it massively superscalar. The units each have a 10 stage pipeline. The Itanium makes extensive use of

¯ predication

¯ speculative loading of both data and instructions

As a result of all of these features, it is able to execute 20 operations per cycle, so the clock speed of around 800 MHz results in a performance equivalent to an X86 or PowerPC chip running at several GHz. The Itanium has

¯ 128 64-bit registers for integer /logical/general purpose use

¯ 128 82-bit registers for floating point and graphics use

The data bus is 128 bits wide (to carry the 128 VLIWs). The address bus is 64 bits wide, giving a theoretical maximum of 64 Exabytes of addressable memory! If you are not sure what an Exabyte is, 1 Exabyte = 1024 Petabytes = 1024 x 1024 Terabytes = 1024 x 1024 x 1024 Gigabytes If you want to find out more about the Itanium chip, http://www.sharkyextreme.com/hardware/guides/itanium/index.shtml

5.5.4 Questions on the IA-64 Architecture Q19: The IA-64 uses VLIW. What does this mean?

Q20: Can the Itanium be described as a superscalar architecture?

Q21: IA-64 chips use predication. Explain the difference between predication and branch prediction.

Q22: How can an 800MHz Itanium outperform a 2.5GHz Pentium?

­c HERIOT-WATT UNIVERSITY 2006 138 TOPIC 5. PROCESSOR DEVELOPMENT

5.6 Parallel Computing 5.6.1 Introduction As we have seen, one of the methods of improving processor performance has been to introduce forms of parallel processing. Put simply, if a processor can do two (or more) things at the same time, its performance will be better. Examples of this we have seen so far include

¯ Pipelining (executing one instruction while fetching the next)

¯ Superscalar architecture (multiple execution units all processing different operations simultaneously)

¯ SIMD instructions (the same instruction being applied to several data items at the same time)

All of these apply to single processors. Another approach is to use several processors working together. This is the basis of most mainframe computers and supercomputers.

5.6.2 Multiprocessing, mainframes and supercomputers The simplest arrangement allowing multiprocessing is to connect several processors to the same system bus, as shown in Figure 5.7

­c HERIOT-WATT UNIVERSITY 2006 5.6. PARALLEL COMPUTING 139

processor 1 processor 2 processor 3

system bus

memory host/PCI bridge

PCI bus

I/O controller other I/O controllers

disks

Figure 5.7: Each processor has shared access to memory (which may use interleaving) and to I/O devices. In some systems, one of the processors controls all the others; this is known as a "master-slave" system. In other systems, all processors are of equal status; this is known as "symmetrical multiprocessing" or SMP. SMP is the system on which most mainframe computers are based. Under this system, up to 10 processors can operate in parallel. Multiprocessing is not limited to mainframe systems. The PowerMac G5 dual processor desktop system has two G5 processors, producing massive performance gains when compared to other single processor computers.

­c HERIOT-WATT UNIVERSITY 2006 140 TOPIC 5. PROCESSOR DEVELOPMENT

Note that these tests are based on the application of a Photoshop filter to a large graphics file - an application ideally suited to parallel processing. The most powerful computers in existence at any time are often called supercomputers. The meaning of the term changes as progress is made. The mainframe systems described above were the supercomputers of few year ago, but now the term is used to describe massively parallel systems, with hundreds or even thousands of processors.

5.6.3 Massively Parallel Architectures For extreme processing applications, massively parallel systems are being developed. They all have the following features

¯ hundreds of interconnected processors

¯ each processor has access to its own local memory or cache

¯ all processors can access a main (global) memory by a systemwide bus

¯ the processors are pipelined - the results from one processor can become the input for another processor

¯ as well as the system bus, there may be local pathways connecting groups of processors into clusters, and other pathways connecting clusters

¯ communication in such a complex array of processors is managed using packet switching technology, as on a large network

There are many varied configurations and examples of these massively parallel computer systems. Lucidor, a the Center for Parallel Computers at the Royal Institute for Technology in Stockholm, is a good example. It consists of 90 interconnected nodes. Each node has two 90MHz Itanium 2 processors accessing 16K of L1 cache, and 256K of L2 cache. Each node can access the system bus via a 128 port switch at a data transfer rate of 2Gbits per second. In addition to the local memory, each node has shared access to

­c HERIOT-WATT UNIVERSITY 2006 5.6. PARALLEL COMPUTING 141

6Gb of main memory. As a result, the system can achieve data processing rates of over 600GFlops per second. You can read more about Lucidor at http://www.pdc.kth.se/compresc/machines/lucidor.html The Hitachi SR2201 is another massively parallel computer system. Various configurations are possible, with anything from 8 up to 2048 processors. The processors (Hitachi RISC chips) are arranged in a 3-dimensional grid to maximise communication between them. As with Lucidor, speeds of up to 600GFlos per second can be achieved. These systems are in use for a variety of applications, including structural and crash analysis, fluid dynamics research, quantum chemistry analysis and visualisation tools. All of these can make use of the parallel architecture, as they require high speed processing of large amounts of data. You can read more about the Hitachi SR2201 at http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html The name Cray has been associated with supercomputers for many years. The CrayT3D is a current example, with 2048 nodes arranged in a 3-dimensional grid. Each node has 2 Alpha processors, with access to individual cache and 8Mwords of memory. Cray claims that this system can process 1 trillion flops per second. You can read more about Cray’s current range of supercomputers at http://www.cray.com/products/index.html Blue Gene is computer architecture project designed to produce several next generation super computers, operating in the PFLOPS (PetaFlops) range. It is a cooperative project between the United States Department of Defense (who are funding the project), industry (IBM in particular), and Academia. There are five Blue Gene projects in development - among them are Blue Gene/C, Blue Gene/L, and Blue Gene/P. Blue Gene/L has 65,536 processors. Each is connected by 3 networks. At the time of writing, Blue Gene/L is the fastest computer in the world, achieving over 70Tflops per second. Figure 5.8 illustrates the structure of Blue Gene.

­c HERIOT-WATT UNIVERSITY 2006 142 TOPIC 5. PROCESSOR DEVELOPMENT

Compute Chip Compute Card Node Card 2 processors I/O Card 16 compute cards ~11mm Field Replacement 0-2 I/O cards 2.8/5.6 GF/s Unit 32 nodes (64 CPUs) 4MiB eDRAM 25mm x 32mm 90/180 GF/s 2 nodes (4 CPUs) 16 GiB DDR 2x(2.8/5.6 GF/s) 2x512 MiB DDR 15W

Cabinet System 2 midplanes 64 cabinets 1024 nodes (2048 CPUs) 65,536 nodes (131,072 CPUs) 2.9/5.7 TF/s 180/360 TF/s 512 GiB DDR 32 TiB DDR 15-20kW 1.2MW 2,500 sq.ft. MTBF 6.16 Days

Figure 5.8:

¯ Each chip has 2 processors

¯ Each card has 2 chips (=4 processors)

¯ Each node has 16 cards (=64 processors)

¯ Each cabinet has 32 nodes (=2048 processors)

¯ The system has 64 cabinets (=131,072 processors)

You can read more about the Blue Gene project at

http://en.wikipedia.org/wiki/BlueLGene

­c HERIOT-WATT UNIVERSITY 2006 5.7. SUMMARY 143

A Blue Gene computer is currently being commissioned at Edinburgh University

Researching a Massively Parallel Computer

Look up the web page http://en.wikipedia.org/wiki/EarthLSimulator Find out the following information about Earth Simulator: 20 min

¯ number of nodes

¯ number of processors at each node

¯ local memory at each node

¯ global memory

¯ processing power in TeraFlops per second

¯ applications

5.7 Summary

¯ the Intel X86 series are typical CISC-based microprocessors used in Windows PCs

¯ the PowerPC series are typical RISC-based microprocessors used in Apple computers

¯ the IA-64 is a new design of microprocessor using a wide range of techniques to maximise performance

¯ the evolution of these microprocessor architectures includes

– increasing clock speeds – increasing data bus widths – the use of pipelining – the introduction of superscalar processing

¯ each of these techniques has led to an improvement in performance

¯ a typical RISC based chip has

– a large number of registers – a small instruction set, with all instructions of the same length – a small number of addressing modes

¯ a typical CISC based chip has

– fewer registers than a RISC chip – a larger instruction set, with varied length instructions – a large number of addressing modes

­c HERIOT-WATT UNIVERSITY 2006 144 TOPIC 5. PROCESSOR DEVELOPMENT

¯ parallel computers (mainframes and supercomputers) typically have

– large numbers of interconnected processors – each processor has access to local memory and global memory – the output from one processor may be the input to another (a form of pipelining) – local pathways allow communication between clusters of processors – packet switching may be used

¯ parallel computers give enormous performance benefits, especially for large data processing applications, such as scientific research and modelling

5.8 Revision Questions Q23: A processor has 8 general purpose registers, 8 floating point registers and 8 SIMD registers. It is likely to be

a) a 6502 processor b) an X86 series processor c) a PowerPC processor d) an IA-64 processor

Q24: A processor has 2 addressing modes, and all instructions are 4 bytes long. It is likely to be

a) a 6502 processor b) an X86 series processor c) a PowerPC processor d) an IA-64 processor

Q25: The IA-64 architecture fetches instructions from memory

a) one at a time b) in a 128 bit bundle containing 3 instructions and a pointer c) in a 64 bit bundle containing 3 instructions and a pointer d) in a 128 bit bundle containing 2 instructions and 2 pointers

Q26: 1 Exabyte

a) = 1024 x 1024 x 1024 Gigabytes b) = 1024 x 1024 x 1024 Megabytes c) = 1024 x 1024 Gigabytes d) = 1024 x 1024 Petabytes

Q27: SMP is

a) symmetric multiprocessing, a type of pipelining used in massively parallel supercomputers b) speculative multiple pipelining, a system to prevent pipelines stalling due to a conditional branch instruction

­c HERIOT-WATT UNIVERSITY 2006 5.8. REVISION QUESTIONS 145

c) symmetric multiprocessing, a system where several processors of equal status cooperate in a parallel computer system d) a type of massively parallel supercomputer

­c HERIOT-WATT UNIVERSITY 2006 146 TOPIC 5. PROCESSOR DEVELOPMENT

­c HERIOT-WATT UNIVERSITY 2006 147

Topic 6

Operating System Functions

Contents

6.1 Prior Knowledge and Revision ...... 149 6.2 Introduction to Operating Systems ...... 150 6.2.1 The operating system ...... 151 6.3 Memory management ...... 153 6.3.1 Monoprogramming ...... 154 6.3.2 Multiprogramming ...... 154 6.3.3 Review Questions ...... 161 6.4 Process management ...... 162 6.4.1 Round robin ...... 164 6.4.2 Priority scheduling ...... 164 6.4.3 First come, first served (FCFS) ...... 164 6.4.4 Shortest job first ...... 165 6.4.5 Multi-level Feedback Queues ...... 165 6.4.6 Review Questions ...... 166 6.5 Input and output (I/O) ...... 167 6.5.1 Direct memory access ...... 169 6.5.2 Virtual devices ...... 169 6.5.3 Buffering ...... 169 6.5.4 Spooling ...... 170 6.5.5 Questions on input and output ...... 170 6.6 File Management ...... 170 6.6.1 Logical View of files ...... 171 6.6.2 Physical View ...... 173 6.7 Summary ...... 184 6.8 Revision Questions ...... 185

Learning Objectives After studying this topic, you should be able to:

¯ identify the following main functions of an operating system

– memory management 148 TOPIC 6. OPERATING SYSTEM FUNCTIONS

– scheduling (process management) – managing input and output – file management

¯ describe the following techniques used in memory management

– variable partitioning of memory – use of best fit, worst fit and first fit algorithms

¯ describe how these techniques relate to efficiency of memory allocation and access speed

¯ describe the need for scheduling in a multitasking system

¯ describe and compare round robin scheduling and multi-level feedback queues in terms of their effect on system performance

¯ describe the use of DMA to improve system performance

¯ describe the role of the OS in mapping between the logical view of files and their physical location

¯ describe contiguous and non-contiguous methods of file allocation

­c HERIOT-WATT UNIVERSITY 2006 6.1. PRIOR KNOWLEDGE AND REVISION 149

6.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ distinguish clearly between the roles of the operating system and application software

¯ name some common operating systems

¯ describe simply the main functions of an operating system

– interpreting user commands – file management – memory management – input/output management – resource allocation – managing processes

Revision Q1: Which of the following is not an operating system? a) MS Windows XP b) MS Office XP c) MacOS X.3 d) Linux

Q2: One task of an operating system is to keep track of where saved documents are stored on external devices such as hard discs. This is an example of: a) memory management b) process management c) file management d) input/output management

Q3: An example of memory management is: a) displaying the desktop on a monitor b) checking that there is space to store a file on a disc c) ensuring that a program does not overwrite the OS when loaded into RAM d) allocating processor time to multiple programs

Q4: A device driver is part of the a) user interface b) file management system c) memory management system d) input/output management system

­c HERIOT-WATT UNIVERSITY 2006 150 TOPIC 6. OPERATING SYSTEM FUNCTIONS

6.2 Introduction to Operating Systems Hardware A computer system is made of hardware. All the things in it that you can (or could) actually touch, that actually have weight, are hardware. The processor, memory, buses, disk drives, the disks themselves, the input devices, the output devices, are all hardware. The hardware might be very costly but, on its own, it would do nothing for you. It is designed to support software. It is designed to be driven by software. Software Software is programs. It could be said that there are two kinds of software: system software, which helps you run the hardware, and application software, which helps you do the things that you wanted the computer for in the first place. Applications communicate with the system software. The system software communicates with the hardware. A simple diagram of the arrangement is shown in Figure 6.1.

Application Software

System Software

Hardware

Figure 6.1: Relationship between software and hardware Application software Application software helps you use the computer to solve problems. Examples of application software are word processors and spreadsheets. Such packages are called application software because they help you apply a general- purpose computer to a particular use. It could be said that an application package is designed to help you do things that would need to be done even if computers didn’t exist: writing reports and letters, making up accounts, keeping records, and so on. An application package, such as a word processor, could not run on the bare hardware, and is not designed to do so. The hardware is too complicated and varied. An application package is designed rather to run under a certain operating system. The operating system provides a set of system calls that can be used by the programmer of an application to work the computer system. The operating system can also provide a library of subroutines, often called an Application Programming Interface (API), that

­c HERIOT-WATT UNIVERSITY 2006 6.2. INTRODUCTION TO OPERATING SYSTEMS 151

application software can use. System software System software is designed to enable you to run a computer without having to know exactly what’s going on inside. System software controls the actual operation of the computer system. You can enter instructions to the computer by typing at the keyboard or clicking on a mouse, and the system software will convert these instructions into the thousands of low-level operations needed for the computer to carry them out appropriately. The operating system is part of the system software.

6.2.1 The operating system The operating system is part of the system software. It can be regarded as an interface in three different ways. First, it is an interface between the parts of a computer system. These work at different speeds and in different ways, and the operating system enables them to communicate. For example, the processor and memory work at different speeds, and both are much faster than any printer; when you send something to be printed, the operating system ensures that the processor doesn’t flood the printer with data. Second, the operating system is an interface between an application package and hardware. Software houses, that create application packages, write them to run under a particular operating system. Third, the operating system is an interface between the hardware and the user. Computers are much too complicated for most of us to operate directly. For a start, they work in binary, whereas we use letters and decimal numbers. People these days want computer systems to save things on disc, print things out, display moving pictures, play music, and all sorts. The operating system enables us to use all the resources that the computer system contains. Interfaces for the user There are two kinds of user interface: command-line interfaces (CLIs) and graphical user interfaces (GUIs). When working with a command line interface, you need to know the commands and type them in accurately. When using a GUI, you do most of your work with a mouse; you recognise names or icons, and use the mouse to select the ones you want. Superficially, a GUI is easier to use than a CLI. Whether, in the end, it is better is a matter of debate, and depends, among other things, on the disposition of the user and what the user wants the operating system to do. MS-DOS had a command line interface. Apple Macs from the start offered a GUI. Windows was developed in the first place as a GUI that could be bolted onto the front of DOS to make it seem more user-friendly. Windows 3.1, for example, was not an operating system but merely an interface. Windows 95 was an operating system with a GUI, and all later versions of Windows have been operating systems as well.

­c HERIOT-WATT UNIVERSITY 2006 152 TOPIC 6. OPERATING SYSTEM FUNCTIONS

What counts as part of an operating system An operating system basically consists of system files that do its actual work (interpreting commands, loading files, and so on), but it usually comes with a collection of utilities such as text editors, file managers, etc. Operating systems often also provide language processors such as compilers, so that users can write programs, and sometimes a library of subroutines for use in those programs. People spend a fair bit of time quibbling about whether such things are part of an operating system. Features of operating systems Operating systems are expected to have similar features. These are:

¯ memory management;

¯ processing modes;

¯ input/output;

¯ filing system;

¯ resource sharing;

¯ user interface;

¯ applications support;

¯ security.

An operating system is meant to provide these features, but it is not on these alone that a system is judged. Other criteria include:

¯ cost;

¯ ease of installation;

¯ ease of management (how easy it is to add new devices, how much time you have to spend working at the operating system rather than at the applications);

¯ reliability (how often it hangs);

¯ performance (how fast it works, how much throughput it can handle);

¯ range of platforms (how many different kinds of computer it can run on);

¯ range of software (what applications are available for it);

¯ customisation (how easily and how much the system can be adapted to particular uses and users).

What a process is The term process is basic to the understanding of the workings of an operating system. A programmer writes programs. The source code of these programs will be in assembly language or in some high level language like Java. Before a program can run on a computer, it has to be converted into machine code. When a program is actually running,

­c HERIOT-WATT UNIVERSITY 2006 6.3. MEMORY MANAGEMENT 153

it contains run-time data and a stack; other information necessary to keep the program running include a program counter and a stack pointer. All of these things together make up a process. A process can be defined as a program in execution. When you run a program, the operating system has to convert it into a process. It is the process, rather than the simple program, that the processor executes. When the process terminates, the program itself remains (on backing store), fit to be converted into a process again if needed. Another name for process is task.

Operating System Layers and Functions On the Web is an animation which shows how the layers of an operating system take part in the task of opening a file held on disk. You should now observe this animation. In the next few topics we will look in more detail at how the different layers of the os work.

6.3 Memory management Memory management is to do with the way an operating system allocates memory to processes. An operating system needs room in memory for itself and for the processes it creates, and user programs need memory for themselves and the data they use. Part of the operating system, known as the memory manager, allocates memory to these processes and keeps track of the use of memory. Some operating systems, such as early versions of Windows (95, 98, NT) provides only a single-user environment. Multi-user memory management is provided by most modern operating systems. Multi- this, that, and the other Most operating systems support multi-tasking. In multi-tasking, several applications or tasks are (apparently) available at the same time, and the user can switch easily between one application and another. Most operating systems also support multi-threading. Threads are sometimes called lightweight processes. One process can have several threads. The threads act independently of one another, but share the same memory space. For example, in a database application, one thread might run the user interface and another might handle the actual database. In a word processor, one thread might deal with taking in and displaying what the user enters and another might deal with printing. A spreadsheet might create a thread to carry out a lengthy set of calculations while other threads maintain the user interface. In a client-server database, queries might each be handled by a thread.

­c HERIOT-WATT UNIVERSITY 2006 154 TOPIC 6. OPERATING SYSTEM FUNCTIONS

Process loading and swapping If a program is to be run (as a process), it needs to be copied into memory from backing store. This action - by no means instantaneous - is known as process loading. Some systems, as we shall see, also take processes out of memory and store them on disc, to make room for another process. This is known as process swapping.

6.3.1 Monoprogramming The simplest method of memory management is to have just one process in memory at a time, and to allow it to use all the memory there is. One obvious problem with this approach is that a program needs to be entirely self-reliant and contain, for example, drivers for all I/O devices it needs to use. This method is no longer used. The technique that evolved for simple PCs was to have the operating system and one user program at a time in memory, as well as a Basic Input Output System (BIOS) in ROM. The BIOS contained the device drivers. This technique can still be called monoprogramming because there is only one user program in memory at a time. DOS basically works in this way. A sketch of memory might look like Figure 6.2

BIOS Program

User Program

Operating System in RAM

Figure 6.2: Memory in a monoprogramming system

6.3.2 Multiprogramming Monoprogramming would not do on a large computer that has lots of users. If a computer is to provide a service to many users at the same time, it has to be able to hold many programs in memory and run them in turn. Having several programs in memory at once is called multiprogramming. Another reason for using multiprogramming (even with a single user) is the urge to get the most out of the processor. Most processes spend much of their time waiting for I/O of some kind or another. If I/O is going on, CPU time is being wasted. It makes more efficient use of the processor to hand it over to another process if the current process is simply waiting for some data. In multiprogramming, memory is divided into parts, of one kind or another. Names for these parts include partitions, segments, and pages. The operating system has to deal with two problems in all forms of multiprogramming. It has to keep the programs separate from each other and to know where all the parts of

­c HERIOT-WATT UNIVERSITY 2006 6.3. MEMORY MANAGEMENT 155

all the programs are. In addition it has to protect programs from each other, so that data from one process doesn’t overwrite another process.

Multitasking Animation On the Web is an animation that demonstrates how the processor is switched between tasks resident in memory. You should now observe this animation. 10 min

6.3.2.1 Fixed partitions

Using fixed partitions is a relatively straightforward way of effecting multiprogramming. The partitions are fixed in size but don’t have to be all the same size. Figure 6.3 shows how a map of memory might look, with two 100k partitions (one for the operating system), and one partition each of 200k, 300k, and 400k. 1000k

Partition 4 Process D 300k

700k

Partition 3 Process C 300k

400k Partition 2 Process B 200k 200k Partition 1 Process A 100k 100k Operating System

= unused space

Figure 6.3: Fixed partitions Drawbacks to fixed partitions The main problem with fixed partitions is that, processes don’t usually fit partitions exactly. As a rule, there will be wasted space in each partition. A map of memory, as in the diagram above, looks fragmented. This waste of space is known as internal fragmentation (internal because the space wasted is inside the partitions). Another drawback is that a process might come along that is too big for any of the partitions. Another is that space wasted in internal fragmentation could, if brought all together, provide space for more processes. Fixed partitions work well enough in batch processing systems, which are on the whole predictable and can be regulated. In time-sharing systems with lots of users, there will often be more processes wanting service than can be fitted into memory. More

­c HERIOT-WATT UNIVERSITY 2006 156 TOPIC 6. OPERATING SYSTEM FUNCTIONS

sophisticated systems of memory allocation are called for.

6.3.2.2 Variable partitions

One solution, is to allocate to each process exactly the amount of memory it needs. When a process turns up, it is loaded into the next available area. This can go on until there is no room left. Figure 6.4 shows four processes in memory. Process A requires 75k of memory, process B requires 150k, process C needs 225k and process D is using 200k. There is 250k of free memory. 1000k

Available Space

750k

Process D

550k

Process C 325k Process B 175k Process A 100k Operating System

Figure 6.4: Variable partitions Process E, needing 150k, turns up and is allocated memory, leading to the partitions shown inFigure 6.5.

­c HERIOT-WATT UNIVERSITY 2006 6.3. MEMORY MANAGEMENT 157

1000k Available Space 900k Process E 750k

Process D 550k

Process C 325k Process B 175k Process A 100k Operating System

Figure 6.5: The arrival of process E Process F arrives, needing but not getting 300k. Process C terminates, which leaves memory looking like Figure 6.6. 1000k Available Space 900k Process E 750k

Process D

550k

Available Space 325k Process B 175k Process A 100k Operating System

Figure 6.6: Process C terminates Note that there is, all told, 325k spare, which would be enough for the new process, but the spare memory is in two fragments, neither of which is big enough in itself. This kind of fragmentation is known as external fragmentation: the wasted space is external to

­c HERIOT-WATT UNIVERSITY 2006 158 TOPIC 6. OPERATING SYSTEM FUNCTIONS

any process allocation. The spaces in memory are sometimes called holes. Then Process B terminates. The holes left by processes B and C coalesce, and process F can be fitted in to the resulting hole (see Figure 6.7). 1000k Available Space 900k Process E 750k

Process D 550k Available Space 475k

Process F

175k Process A 100k Operating System

Figure 6.7: Process F can be executed once process B terminates Fitting Once things are up and running, new processes need to be fitted into the holes available. There are three policies to choose among when allocating processes to space. These are summarised in Table 6.1.

Table 6.1: Policies for allocating processes to space Incoming process is First fit fitted into the first hole big enough. Incoming process is fitted into the Best fit smallest hole that can take it. Incoming process is Worst fit fitted into the largest hole going. First fit is at least quick and simple. Best fit seems the obvious idea. The problem with it, though, is that it tends to leave small and therefore useless holes. Worst fit doesn’t seem a good idea, but it gets round the problem of best fit, because the hole left will often be big enough for another process.

­c HERIOT-WATT UNIVERSITY 2006 6.3. MEMORY MANAGEMENT 159

Information about holes is often kept in a linked list. Coalescing holes is straightforward. First fit will take the first hole large enough; the other two methods will need to scan the entire list before a decision can be made. Compaction Fragmentation can be kept down to one hole (of whatever size) by shifting processes around whenever one leaves memory. All remaining processes are packed down (compacted) so as to leave a single hole at the top. In the previous series of memory maps, process F arrived and process C terminated. Process F did not have enough room. However, with compaction, processes D and E could be moved down in memory, leaving enough room at the top for F. The resulting memory would look like Figure 6.8. 1000k Available Space 975k

Process F

675k Process E 525k

Process D

325k

Process B

175k Process A 100k Operating System

Figure 6.8: Memory after compaction All such gains in sophistication have a cost in terms of complexity of implementation. The more intricate the algorithm, the more intricate the code needed to make it happen. Furthermore, an action such as compaction is not instantaneous but takes time. The overheads that compaction calls for tend, in fact, to be so great and make things so slow that the scheme is rarely used in practice. Paging In the discussion so far, we have assumed that processes are relatively small and can fit into memory. However, from the very early days, there has been the problem that some programs are too large to fit into memory at all, never mind into a fraction of it such as a partition. In a paged system, processes are chopped up into a number of chunks all of the same size. These chunks are called pages. Typical page sizes are 4 kilobytes or 32 kilobytes. Memory is regarded as a set of page frames, each of a size to accommodate one page

­c HERIOT-WATT UNIVERSITY 2006 160 TOPIC 6. OPERATING SYSTEM FUNCTIONS

exactly. One of the (logical) pages used by the processor can be assigned to any page frame. With respect to internal fragmentation, the most space a process can waste, then, is the spare capacity in its last page. One of the advantages of paging, therefore, is that it leads to a more efficient use of memory than partitioning schemes. Let us imagine five processes, which need pages as follows:

Process Frames A 5 B 3 C 2 D 6 E 4 Processes A, B, and C come in, and are given space in memory, which contains 16 page frames.

A1 A2 A3 A4 A5 B1 B2 B3 C1 C2

Process B terminates and processes D and then E come in. Note that the pages relating to a process don’t need to be contiguous i.e. neighbouring. Process D takes up the page frames vacated by B and then takes further frames beyond C. However, E needs four frames and is too large for the remaining number of page frames, so it will have to wait, and the three spare page frames are, for the time being, wasted.

­c HERIOT-WATT UNIVERSITY 2006 6.3. MEMORY MANAGEMENT 161

A1 A2 A3 A4 A5 D1 D2 D3 C1 C2 D4 D5 D6

What with pages dotted about here and there, it’s obviously important to keep the place. Each address in memory can be regarded as having two components, a page number and a displacement (or offset) within the page. When a page from a process is placed in a page frame, the page number changes but the displacement stays the same. The operating system keeps the place by maintaining a page table, the purpose of which is to map virtual pages onto page frames. It contains each possible process page number and its corresponding memory page number, and works like a function that takes in a virtual page number and returns the actual frame number. Paging is most useful when the current processes are too large to fit into memory. We then need to use the idea of virtual memory. We shall be looking at virtual memory a little later.

6.3.3 Review Questions Q5: Distinguish (using examples) between multi-user, multi-tasking and multi- threading.

Q6: What is meant by "process loading"?

Q7: Explain why there will be more than one program in memory even in a mono- programming system such as DOS?

Q8: Explain the system of fixed partitions, including its purpose, and its 2 major drawbacks.

Q9: Explain how variable partitioning overcomes the drawbacks of fixed partitioning.

Q10: Explain why "best fit" is not necessarily the best method to use.

Q11: Explain what is meant by paging, and why it is useful.

­c HERIOT-WATT UNIVERSITY 2006 162 TOPIC 6. OPERATING SYSTEM FUNCTIONS

Q12: Match the 3 fitting algorithms to the correct description: Fitting Algorithm Description First Fit Best Fit Worst Fit Descriptions:

¯ New process is allocated to the smallest space into which it will fit

¯ New process is allocated to the largest space available

¯ New process is allocated to the first big enough space in memory

6.4 Process management Modern computer systems look as though they can do several different things at the same time. PCs contain only one processor, and it is the processor that really does things, so in fact, at any given moment, only one thing is ever being done. However, because it works very fast, the processor can do a little work on one thing, then a little work at the next, and so on. This enables it to give the impression of doing lots of things at once. In discussing memory management, we looked at ways of making more than one process available in memory for the processor to work on. Process management is to do with choosing which process to run, and how long to run it for. This is managed by part of the operating system called the scheduler. A scheduler works according to a particular scheduling algorithm. The algorithm is subject to certain constraints:

¯ it should be fair, so that each process gets its fair share of the processor;

¯ it should be efficient, to maximise use of the processor;

¯ it should achieve high throughput;

¯ it should have a good response time.

Many processes are, obviously, started directly by the user. Others are started by the operating system. Processes can themselves start other processes. All these processes have a program counter, registers, and other information. The processor performs what’s called a context switch when it changes from one process to another: it saves all the data relevant to the first process and replaces them with the data to do with the second. Pre-emptive scheduling Before people thought of scheduling, a process, once it had gained control of the processor, would run to completion. Such an approach is still tenable on PCs that expect to have only one user at a time (DOS works in this way), but won’t do at all on a multi-user or multi-tasking system. One answer is cooperative scheduling where each process would, in principle, run for a little and get a bit of work done. Then it would nobly resign the processor to the

­c HERIOT-WATT UNIVERSITY 2006 6.4. PROCESS MANAGEMENT 163

operating system which could do a bit of housekeeping or allow another process to have a go. What happens, of course, is that processes tend to act selfishly and keep the processor to themselves for as long as they can. The problem is that the operating system gives up too much control. Cooperative scheduling would work even worse in a multi-user system. Somebody trying, say, to break the world record for prime numbers might hog the processor almost indefinitely. The scheduler needs, then, to be able to suspend one process, at any point, and give another process a turn. The operating system keeps control. Such scheduling is known as pre-emptive scheduling. Process states A process enters the system and starts running. When it’s finished, it terminates. While it’s in the system, the process can be in one of three states:

¯ running;

¯ blocked;

¯ ready.

The process runs until obliged to wait for I/O or pre-empted by the scheduler (to give another process a turn). If it is pre-empted, it is still ready to use the processor. If it is waiting for I/O, it is described as blocked, and stays in that condition until the I/O is over. When the I/O is completed, the process passes into the ready state again. When its turn comes round again and it is ready, it is shifted back into the running state by the scheduler. This is shown in Figure 6.9.

Process entry Termination Running

I/O Process wait selected Timeout

Blocked Ready I/O done

Figure 6.9: Process state diagram We now go on to look at different scheduling algorithms.

­c HERIOT-WATT UNIVERSITY 2006 164 TOPIC 6. OPERATING SYSTEM FUNCTIONS

6.4.1 Round robin Processes are put in a sort of circular list. Each process in turn is assigned a time-slice. If the process is still running at the end of the time slice, it is pre-empted and has to wait its turn again. Switching processes takes time and thus wastes processor time. Too long a time slice might mean irritatingly long waits. Time slices are usually about a tenth of a second long. This algorithm does not require any expectations about the length of time a process will take to complete. It is therefore suitable to interactive systems. The algorithm is also fair, in that all jobs are treated as though they are equally important.

new processes PPP processor

PPP

6.4.2 Priority scheduling Of course, not everyone wants things to be fair: what many people want is for their processes to be regarded as more important than other people’s. In priority scheduling, each process is given a priority, and the job with the highest priority is processed. In many systems, there is a queue of jobs for each level or priority, and the jobs within a certain level are treated to round-robin scheduling. Because low-priority jobs might, on a busy system, never get a turn, things are sometimes organised so that jobs that have been waiting a long time have their priority raised. (There are stories of large computers being de-commissioned and jobs being found on them that had been waiting for a little processor time for years.) Three algorithms used for priority scheduling are first come first served, shortest job first and multi-level feedback queues.

6.4.3 First come, first served (FCFS) This algorithm is straightforward, and works well enough for long jobs. Consider four jobs A, B, C, D which have run time of 20, 10, 4 and 2 minutes.

­c HERIOT-WATT UNIVERSITY 2006 6.4. PROCESS MANAGEMENT 165

Processing Turnaround Job time time A 20 20 B 10 30 C 4 34 D 2 36 This gives an average turnaround time of 30 minutes, which compares badly with an average run-time of 9 minutes.

6.4.4 Shortest job first In batch systems, and other systems where the running time of a process can be predicted, we can use the principle that the shortest jobs should be run first. This would give us results as follows:

Job Running time Turnaround time D 2 2 C 4 6 B 10 16 A 20 36 This gives an average turnaround time of 15 minutes: the longest job does worse but the three other jobs do a lot better. The shortest-job-first approach gives optimal turnaround times. On the other hand, long jobs on busy systems might have to wait indefinitely because shorter jobs keep getting in the way. There are methods available whereby interactive systems can estimate how long a process is going to need.

6.4.5 Multi-level Feedback Queues Each of these methods has advantages and also disadvantages. The multi-level feedback queue algorithm tries to combine the best of each. The system maintains several process queues. In this example, we will assume there are 3 queues, but there may be more. The top level queue has highest priority, but each process in this queue is only allocated one very short time slice. Any new process is allocated to this queue. If it is a short process, it will be quickly dealt with. Many processes only require a short amount of processor time initially, then have to deal with I/O for a longer time independently of the processor (perhaps using DMA). These processes are also quickly dealt with. Any process in the top level queue is guaranteed to get some processor time very quickly. If it is a longer process, it is given a short time slice, then moved down to the second level queue, so that the top level queue is not held up. The second level queue gets processor time if the top level queue is empty. It works in a similar way to the top level queue, but with longer time slices (typically twice as long as the top level queue). If a process is till not completed within the longer time slice, it is pushed down to the next queue. The final queue is a round robin queue, which continues to allocate time slices to

­c HERIOT-WATT UNIVERSITY 2006 166 TOPIC 6. OPERATING SYSTEM FUNCTIONS

processes (when the processor is not being used by the higher queues) until they are completed.

Level 1 q=1 new PPP processor processes

Level 2 q=2 PP processor

Level N q=2n P PPP processor

6.4.6 Review Questions Q13: What states can a process be in?

Q14: When a process is blocked, it means that

a) It’s held up by another process b) It’s waiting for some I/O to finish c) The scheduler’s gone wrong

Q15: When a running process is timed out, it becomes

a) Ready b) Terminated c) Blocked

Q16: What’s the difference between cooperative and pre-emptive scheduling?

Q17: A process can have many threads.

a) True b) False

Q18: Each thread of a process has its own space in memory.

a) True b) False

­c HERIOT-WATT UNIVERSITY 2006 6.5. INPUT AND OUTPUT (I/O) 167

6.5 Input and output (I/O) Computers are, essentially, input/output machines. They take in data, process them, and put out the results. Many users of PCs aren’t much bothered about explicit processing, they don’t want the thing to solve differential equations for them; they want to be able to write things like reports, and they want to use e-mail and the Internet; they want to type things in, see things on the screen, and print things out. It follows that how the operating system handles input and output is now of the first importance. Users expect input and output to be easy and straightforward. Input/output is often abbreviated to I/O. The I/O system handles the interactions between the processor and all the devices that are part of the computer system as a whole. Among devices that are regarded as part of the computer could be listed the keyboard, the mouse, disk drives, modems and screen. The I/O system also has to deal with obviously external devices such as printers and scanners. These devices are, in general, different from one another: they work in different ways and at different speeds. All these speeds, however, are slower than the computer’s processor and memory. One of the I/O system’s objectives is efficiency: making the most of the attached devices and getting them to work at their highest useful rates. In the I/O system, the software can be thought of as organised in layers: the lower one deals directly with the hardware and the upper one presents an interface to the user or application. (There are so many different kinds of device that no operating system could include the code to handle them all.) The upper level is the I/O control system, which takes commands from the user (either directly or through an application), checks them, and sends them to the appropriate part of the lower level. The lower level consists of device drivers. A device driver is a piece of software that handles one device or type of device. For example, every printer on a computer system will need a device driver, but that device driver might work with more than one model of printer. The device driver converts commands originating from the user into instructions for the device to carry out. The device driver communicates with the device controller, which is part of the hardware, plugged into the computer’s I/O bus. The controller, in turn, communicates with the device itself, converting the electronic instructions received into physical action. Figure 6.10 shows how these levels are related. An application starts things off by making a system call that demands action from the I/O system.

­c HERIOT-WATT UNIVERSITY 2006 168 TOPIC 6. OPERATING SYSTEM FUNCTIONS

Application program

I/O Control system Operating Device System driver I/O Bus Device controller Hardware Device

Figure 6.10: Levels of an operating system Device independence One of the main ideas involved in the design of an I/O system is device independence. What this means is that the user should not need detailed knowledge of the devices in the system or of how they work. For example, the system should make it as easy for the user to handle data from a floppy disc as from a hard disc or a CD-ROM, even though these three kinds of disc are made of different materials and are run by different technologies and mechanisms. The I/O system hides the physical details of a device from the users and presents then with straightforward means of access and control. Input/Output in Windows Windows 98 makes use of Dynamic Link Libraries (DLLs). A DLL is a piece of executable code (a device driver, for example) that can be shared by different processes. The routines in a DLL are only linked into an application at run time. In Windows, all the device drivers are dynamic link libraries. There’s a DLL for the keyboard and a DLL for the display, and so on. This saves memory: most applications need to use these drivers and can share them (rather than each application having its own set of device drivers). Furthermore, if the computer’s on a network, with a range of printers available, any application can use any printer driver. The driver for a new device can simply be added to the library of DLLs. Plug and play Adding a new device to a system has in the past been rather a fiddly business. Often a new control card has to be plugged into the computer. Very often, switches on the device would have to be set by hand. Then the computer system would often need a lot of tweaking before the operating system and the device could communicate in a satisfactory way. The concept of plug and play tries to make the business more automatic.

­c HERIOT-WATT UNIVERSITY 2006 6.5. INPUT AND OUTPUT (I/O) 169

Plug and play can be regarded as a sort of convention or standard agreed between the companies responsible for the different parts of a computer system: computer manufacturers, chip manufacturers, device manufacturers, and so on. Things are designed to suit the plug and play standard; devices are designed so that they can be set by software (rather than having switches set by hand). Most modern operating systems automatically detect all plug and play devices on the system. The user has to supply very little information during installation and nothing at all from then on.

6.5.1 Direct memory access As a rule, a device controller will have its own memory on board. This enables it to use Direct Memory Access (DMA) to get data to and from the computer’s memory. In DMA, the processor only sets up the transfer. Thereafter, some clock cycles are used by the I/O system for the data transfer. During these cycles, the processor is doing nothing but, during the other cycles, the processor can be getting on with its work: it doesn’t have to suspend the current process or spend time on context switches. On the whole, DMA saves time by saving the processor for other things than supervising data transfer.

6.5.2 Virtual devices A virtual device seems, to the user, like an actual device but is, in fact, a simulation maintained by the I/O system. A print spooler is a virtual device, that takes data sent to the printer, stores them temporarily on disc, and prints them when the system is ready to do so. This means that several print jobs can be sent to the printer at the same time and will be dealt with in an orderly manner. The jobs are placed in a print queue and serviced one by one. It also means that a user does not have to hang around while the actual printing takes place but, having sent a print job to the spooler, can get on with other work on the computer.

6.5.3 Buffering All I/O is relatively slow. For most of us, input by typing is painfully so. Even very fast typists are, from the processors’s point of view, so slow that they’re hardly doing anything at all. Screens and printers work a fair bit faster (as a rule) but all the same, the I/O system needs to work the devices efficiently so as not to waste the procesor time. A buffer is an area of memory set aside to help in the transfer of data between the computer and a device. A buffer provides a sort of barrier between parts of the computer system that work at different speeds. Buffering is used in sending blocks of data. Data is transferred into the buffer until it is full. Then the entire block is sent at once. This is more efficient than having the data trickle through the system. With single buffering, the receiver has to wait for a block before it can do anything with it, and the transmitter of the data has to wait for the receiver to have processed the block before it can send another. With double buffering, two buffers are used and, as one is emptied, the other is being

­c HERIOT-WATT UNIVERSITY 2006 170 TOPIC 6. OPERATING SYSTEM FUNCTIONS

filled up. This makes more efficient use of both transmitter and receiver of the data.

6.5.4 Spooling When large amounts of data are to be sent to a peripheral device, or when the peripheral is shared across a network then spooling is a preferred method of compensating for the difference in speeds of the processor and the peripheral. Spooling involves the input or output of data to a tape or a disk. This, for example, allows output to be queued from many different programs and sent to a printer by a print spooler (special operating system software). The print spooler stores the data in files and sends it to the printer when it is ready, using a print queue.

Spooling to a printer On the Web is a simulation that shows how spooling is handled by the operating system. 5 min You should now look at this animation.

6.5.5 Questions on input and output Q19: A device driver is a piece of hardware.

a) True b) False

Q20: A print spooler is a piece of hardware.

a) True b) False

Q21: When we talk of DMA, do we mean access by

a) The user b) A device controller c) An application

6.6 File Management The file management layer is perhaps the most familiar aspect of the Operating System to most users. Whenever you carry out any of the following operations:

¯ loading an application

¯ saving a document

¯ creating a new folder

¯ renaming a file

¯ deleting an unwanted application

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 171

¯ copying a document from the hard disc to a CD you use the user interface of the operating system to manipulate files. Your mouse click or keyboard command is interpreted by the user interface and passed to the file management system to be executed. To the user, a file is a meaningful data object, with a name, stored on some device. The user is not usually concerned with the physical detail of exactly which track and sector of the disc is being used to store the file. To the computer, a file is simply binary data stored somewhere on a device. The computer is not concerned about what the file means to the user. As a user, you may arrange similar files to be stored in a single folder, so that you can easily find them when required. The actual data may be scattered across the storage device. A major function of the file management section of the operating system is to map the logical view (as seen by the user) on to the physical view (as seen by the computer). We will consider these views separately, then look at how the operating system can map them to each other.

6.6.1 Logical View of files From the user’s viewpoint, a file is a meaningful collection of data. The file will have a name, a file type, a location (within a folder on a disc), and a size. It may have some other attributes such as its creation date, date last modified, whether or not it is read- only, and so on. Folders and directories Files are grouped in directories or folders. Directory is a computing word, which shall be explained later. Folder is a word first used on Apple Macs and again derives from office work, where files might be kept in folders. So far as a user is concerned, folders and directories are the same thing. Folders can contain folders. Applications install themselves into folders. Users arrange their work into folders. These notes are being written on a laptop. They are being written using a package called Xmetal, which has been installed in a folder called xml on the root, and the simple diagrams are drawn with a package called Freehand, which is in a folder called softquad in the Programs folder. The writer refers to himself as B (and doesn’t use the folder My Documents because he’s not going to be the machine’s only user). Figure 6.11 shows part of the folder structure on the laptop’s hard disc.

­c HERIOT-WATT UNIVERSITY 2006 172 TOPIC 6. OPERATING SYSTEM FUNCTIONS

C:\ (root)

Bxml Programs

xml Softquad

os

Figure 6.11: Example folder structure This file is in the os folder, along with all the diagrams. There are, of course, many other folders in the root, and also many others in the Programs folder and in the writer’s folder. Folders make things tidy and simple. To have these notes printed out or processed to appear on the web, the writer has simply to transmit or make available the os folder, with all the text and diagrams and other things that are in it. Folders help you keep your work compartmentalised. Furthermore, if you’re looking for a file you’ve made, you can limit the search (whether done automatically or by eye) to your work folders. Properly organised, folders make backing up and restoration straightforward. You keep your work separate from the application you’re using. You can make a copy of your work. If the application becomes corrupted, you can restore it from the original discs without your work being affected. If one of the copies of your work becomes damaged, you can restore it from the other. The arrangement of folders on the root containing folders that contain folders and so on, is often described as hierarchical. Paths In the previous figure, there are two folders called xml, one for the application package and one for the files that the writer creates using that package. There is no confusion

because the folders are on different paths: their full names, so to speak, are different.

ÒÜÑÐ 'ÒÒÜÑÐ One is ' and the other is . There could not, of course, be two folders with the same name in the same place; there could not, say, be two folders called xml on the root. Note, by the way, the fact that an application is stored in one place but that the files created by using it are kept in another. This keeps things tidy: any files in the application folder are part of the application; reinstallation or upgrading will not inadvertently cause problems to files created by the user. In the same sort of way, files are known to the system by their full names, including paths.

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 173

Q22:

C:\ (root)

Bxml Programs

xml Softquad

os

If a file called os1.xml were in the os folder above, what would be its full name (including its full path)?

6.6.2 Physical View Most storage devices are divided up into tracks and sectors, forming blocks ofafixed size, perhaps 512 bytes.

Inter-sector gap

Blocks

Track

Sector

The term cluster is also used to describe the smallest section of a storage device that the operating system will recognise. This may be as small as a single 512 byte block, but on many modern large hard disc systems, the basic unit (block or cluster) may be

­c HERIOT-WATT UNIVERSITY 2006 174 TOPIC 6. OPERATING SYSTEM FUNCTIONS

as large as 32K. Each of these blocks or clusters is numbered, so that the operating system can keep track of where data has been stored. We can represent a storage device diagrammatically as a collection of numbered clusters: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Files can be allocated to these blocks (or clusters) in several different ways. These include contiguous allocation (using best, first or worst fit), linked allocation, indexed allocation and file allocation tables (FAT).

6.6.2.1 Contiguous Allocation

The simplest method is contiguous allocation. This means that each file is stored in a series of neighbouring clusters. The operating system then keeps a directory of where each file is stored, with a pointer to the start of each file. This system was used on some early computers, such as the BBC micro. Unfortunately, it has a number of weaknesses, as illustrated below. Imagine a situation, where the following sequence of user commands were required, using a disk with 30 clusters, each of size 32K: Command 1: save file 1 (size 64K) This will fit in 2 clusters. First cluster is reserved for the disc directory. File 1 allocated to clusters 2 and 3.

121 3 1 45678910

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

Command 2: save file 2 (size 97K) This will need 4 clusters (just too big to fit into 3 clusters). File 2 allocated to clusters 4,5,6 and 7.

121 3 1 456789102222

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 175

Command 3: save file 3 (size 3K) This will need 1 cluster (although it is only 3K, a 32K cluster is the minimum allocation). File 3 allocated to cluster 8.

121 3 1 4567891022223

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

Command 4: save file 4 (size 220K) This will need 7 clusters. File 4 allocated to clusters 9 to 15.

121 3 1 456789102222344

1144444 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

Command 5: delete file 2 Clusters 4 to 7 now free.

121 3 1 456789103 44

1144444 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

Command 6: save file 5 (220K) This will need 7 clusters. These need to be contiguous, so file 5 allocated to clusters 16 to 22.

121 3 1 456789103 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23 24 25 26 27 28 29 30

­c HERIOT-WATT UNIVERSITY 2006 176 TOPIC 6. OPERATING SYSTEM FUNCTIONS

Command 7: copy file 1 (64K) This will need 2 clusters. File 1(copy) allocated to the now empty clusters 4 and 5.

121 3 1 456789101b 1b 3 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23 24 25 26 27 28 29 30

Command 8: save file 6 (70K) This will need 3 clusters. These must be contiguous, so clusters 6 and 7 cannot be used. File 6 allocated to clusters 23 to 25.

121 3 1 456789101b 1b 3 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 26 27 28 29 30

Command 9: save file 7 (190K) This will need 6 clusters. These must be contiguous, so clusters 6 and 7 or clusters 26 to 30 cannot be used. Although there are more than 6 free clusters, the file cannot be copied as the clusters are not contiguous.

121 3 1 456789101b 1b 3 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 26 27 28 29 30

Command 10: add data to file 1 Perhaps file 1 is a word processed document, and the user loads it up, adds some extra paragraphs, then tries to save it. The system will be unable to save the file as the next cluster (cluster 4) is already in use.

121 3 1 456789101b 1b 3 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 26 27 28 29 30

The example above illustrates some of the weaknesses of contiguous file allocation (a

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 177

real disc may have thousands of files stored in thousands of clusters, so the problems become even more severe):

¯ The disk becomes fragmented (the free space is split up and scattered across the disc surface, rather than in usable contiguous blocks)

¯ Files cannot be extended

The system must also maintain a directory of where the files are stored. This can be a simple table with the file ID (a number) and the number of the cluster where the file begins. This could be stored in the first cluster on the disc. For the example above, the directory after command 9 would look like:

File ID Starting Cluster 1 2 1b 4 3 8 4 9 5 16 6 23

6.6.2.2 Allocating files to free space

In the previous example, at command 7, the system had to decide where to put the copy of file 1. Three methods are possible for choosing where to put the file. You have already met these three methods in the topic on memory management. They are

¯ First fit (use the first available large enough group of clusters)

¯ Best fit (use the group of available clusters which most closely fits the file)

¯ Worst fit (use the largest available group of clusters)

­c HERIOT-WATT UNIVERSITY 2006 178 TOPIC 6. OPERATING SYSTEM FUNCTIONS

Q23:

121 3 1 456789103 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23 24 25 26 27 28 29 30

Where would the copy of file 1 be placed if the system used

a) first fit b) best fit c) worst fit

Systems which used contiguous allocation required a utility program to compact all the free space. The utility moved all the files together, grouping all the free space into a single extent (contiguous group of clusters). Compaction would turn:

121 3 1 456789101b 1b 3 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 26 27 28 29 30

into:

121 3 1 456789101b 1b 3 4444

1144455 12 13 14 15 1655555 17 18 19 20

21666 22 23 24 25 26 27 28 29 30

This would allow command 9 (copy file 4) to be carried out, but command 10 (extend file 1) would still be impossible. To avoid the problems of contiguous allocation, most systems use some form of non- contiguous allocation. This means that files can be split up into clusters which are not adjacent to each other.

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 179

6.6.2.3 Linked Allocation

Using linked allocation, files do not need to be stored in contiguous clusters. This would allow command 9 to succeed. Command 9: save file 7 This will need 6 clusters, which do not need to be contiguous. So file 7 can be stored in cluster 6 and 7, and 26 to 29.

121 3 1 456789101b 1b 773 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 267777 27 28 29 30

Command 10 can also be executed, as file 1 can be extended into cluster 30:

121 3 1 456789101b 1b 773 44

1144444 12 13 14 15 1655555 17 18 19 20

2155 22 23666 24 25 2677771 27 28 29 30

As before, the system must maintain a directory of where each file starts. One further requirement is that each cluster must contain a pointer to the next cluster which the file uses. So, for example, cluster 3 must have a pointer to cluster 30, and cluster 7 must have a pointer to cluster 26. This will reduce the amount of space for the actual data in each cluster, but this is a small price to pay for the more efficient use of the disc space.

­c HERIOT-WATT UNIVERSITY 2006 180 TOPIC 6. OPERATING SYSTEM FUNCTIONS

In a more realistic example, with a large hard disc, files could be spread over a large number of non-contiguous clusters, like this:

20

end

80

d

n e my file.doc

80

3

5

4 26

2

20

5 3

42

1 2 my other file.jpg

Figure 6.12:

In Figure 6.12

ÑÝ ,к"Ó! ¯ starts in block 26, has a pointer to block 35, and a pointer to block

20, where the file ends.

ÑÝ ÓØ-Ö ,к<Ô. ¯ starts in block 12, has a pointer to block 42, and a pointer to block 80 where the file ends.

Other disadvantages of linked file allocation include:

¯ Loading a file can be slow as the file may be widely spread around the disc, so the disc head has to track back and forth to retrieve each section of the file

¯ If a cluster in the middle of a linked file becomes corrupted, it may be impossible to retrieve the rest of the file, even though the data is still present on the disc, as the chain of pointers may be broken.

6.6.2.4 Indexed Allocation

Indexed allocation works in a very similar way to linked allocation, except that, instead of each cluster having a pointer to the next cluster, all the pointers for a file are stored together in an index block. There will be an index block for each file stored on the disk. One disadvantage of this is that when reading a file, the system must first read the index block. This will slow down file reading, especially for small files. For large files, reading one extra block is not such a problem. To reduce this overhead, some systems copy the index blocks into memory. A very fragmented file will have a large index block, containing many pointers, so systems attempt to keep files as contiguous as possible. Despite these disadvantages, most modern operating systems use some form of

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 181

indexed non-contiguous file allocation.

6.6.2.5 File Allocation Tables

One version of indexed allocation, used by MS Windows, is FAT - file allocation tables. A File Allocation Table is an index block for the whole disc. It has a directory which points to the first cluster of each file. The FAT has an entry for every cluster on the disc. This entry will be either

¯ a pointer to the next cluster for that file, or

¯ a zero to indicate that this cluster is unused, or

¯ an end of file marker

So, a FAT for a disc may look like this:

Filename First cluster letter.doc 4 expenses.xls 8 fun.exe 5 note.txt 11

Cluster Pointer 0 - 1 2 2 EOF 3 0 4 6 5 1 6 12 7 0 8 EOF 9 10 10 EOF 11 9 12 13 13 EOF

So ÐØØÖº"Ó! begins in cluster 4. Cluster 4 points to cluster 6, which points to 12,

which points to 13, which is the end of the file. ÐØØÖº"Ó! is stored in cluters 4,6,12 and 13.

Q24: Use the FAT above to work out where each of the 4 files above are stored on the disc.

Assuming each entry in the FAT is a 16 bit number, this limits a disc to 216 (65,536) clusters. If each cluster was 32K (as in our example), the maximum disc capacity would

­c HERIOT-WATT UNIVERSITY 2006 182 TOPIC 6. OPERATING SYSTEM FUNCTIONS

be 65,536 x 32K = 2,097,152K = 2Gb. The only way to use FAT16 for larger discs is to make the minimum cluster size larger. So, to index an 80Gb disc using FAT16 would require clusters of 1280K (1.2Mb). Many files would be smaller than this, so huge amounts of space could be wasted on the disc. To overcome this, Microsoft introduced FAT32 which uses 32 bit pointers. This allows up to 4 billion clusters, so clusters can be made much smaller, allowing more efficient use of disc space. However, the FAT would be very large if it had an entry for each cluster (it would be 4 billion x 32 bits). In practice a compromise figure is used depending on the disc capacity.

6.6.2.6 The file management system in operation

How does the file management system relate to the other layers of the operating system? This diagram and example should help you to understand how this works:

user

program user interface (application)

device file manager directory

I/O system (device drivers)

storage RAM device

Suppose a user requires a file to be copied from a CD to the hard disc of the computer. The user simply drags the file icon from the CD window to the hard disc window on the

­c HERIOT-WATT UNIVERSITY 2006 6.6. FILE MANAGEMENT 183

desktop. What is going on within the system?

¯ The user interface detects the mouse movement and the pressing of the mouse button to drag the icon across the screen.

¯ The user interface sends a message to the file management system to find the file on the CD.

¯ The file management system checks the directory for the CD drive, locates the file and checks its size. (note: the directory will need to be read from the CD unless it is already mounted, in which case the directory may be already available in RAM)

¯ File management sends a message to memory management to reserve a block of RAM to hold the file.

¯ File management instructs the appropriate device driver to activate the CD drive, and read the file into RAM.

¯ The device driver notifies the file manager that the file has been read successfully.

¯ The file manager then checks its directory for the hard disc, and locates a suitable group of clusters in which to store the file.

¯ The file manager instructs the device driver for the hard disc to write the file from RAM to the disc.

¯ Once completed successfully, the I/O system notifies the file manager that the file has been written.

¯ File management notifies memory management that the block of RAM used to hold the file can now be released for other use.

¯ File management notifies the user interface that the operation has been completed successfully

¯ The user interface displays a message on screen that the file has been copied to disc.

Note that instructions to the file management may come directly from the user via the user interface, or may be requested by an application on behalf of the user. For example, a user may be using a word processing package with auto-save feature. Every 5 minutes, the application sends a request to file management to save the file currently held in RAM. When completed successfully, the file manager will send a message back to the application, which may indicate this to the user with a pop-up window, or an audible sound.

Q25: Describe the sequence of operating system tasks that will result from a user selecting File - open - document.doc from within a word processing application. Assume that the file is stored on the hard disc of the computer system.

­c HERIOT-WATT UNIVERSITY 2006 184 TOPIC 6. OPERATING SYSTEM FUNCTIONS

6.7 Summary

¯ the main functions of an operating system include

– memory management – scheduling (process management) – managing input and output – file management

¯ memory management may use

– variable partitioning of memory – best fit, worst fit or first fit algorithms

to allocate files to memory

¯ A multitasking system must allocate processor time to different tasks - this is called scheduling

¯ Methods of scheduling include

– round robin scheduling (giving each process a short time slice in turn) – multi-level feedback queues

¯ DMA (direct memory access) improves system performance, by allowing large blocks of data to be transferred between a storage device and memory without involving the processor

¯ The file management system maintains a mapping between the logical view of files and their physical location

¯ The physical allocation of files to clusters on a storage device can be

– contiguous, or – non-contiguous - using linked allocation, indexed allocation or file allocation tables

­c HERIOT-WATT UNIVERSITY 2006 6.8. REVISION QUESTIONS 185

6.8 Revision Questions Q26: Which of these is not a function of a standard operating system? a) memory management b) virus checking c) file management d) scheduling

Q27: An operating system allocates files to the largest available area of free RAM. This algorithm is known as a) best fit b) first fit c) worst fit d) last fit

Q28: Round robin scheduling a) is only required in multi-user system b) gives each process equal times slices in order c) gives larger time slices to high priority processes than to low priority processes d) uses multiple queues to ensure all processes are treated fairly

Q29: A cluster a) is the smallest area of a storage device that can be allocated to a file b) is a group of files, all in the same folder or directory c) is a table showing where on a disc a file is store d) is a complete track on a hard disc

Q30: In a contiguous file allocation system, a) files can be split up into several sections and stored in a number of locations across a disc b) pointers are used to indicate where the next cluster of a file is stored c) a file can only be stored in a series of neighbouring clusters d) file allocation tables are used to index the disc

­c HERIOT-WATT UNIVERSITY 2006 186 TOPIC 6. OPERATING SYSTEM FUNCTIONS

­c HERIOT-WATT UNIVERSITY 2006 187

Topic 7

Operating System Developments

Contents

7.1 Prior Knowledge and Revision ...... 189 7.2 The User interface ...... 189 7.2.1 Command line interfaces ...... 190 7.2.2 Graphical user interfaces ...... 191 7.2.3 Questions on the user interface ...... 193 7.2.4 Trends in User Interfaces ...... 193 7.2.5 Comparing CLI and GUI system demands ...... 195 7.3 Operating System Application Services ...... 196 7.3.1 Applications support ...... 196 7.3.2 Communication between processes ...... 197 7.3.3 Services provided to program by the operating system ...... 198 7.4 User Access ...... 198 7.4.1 Back up ...... 199 7.4.2 Passwords ...... 199 7.4.3 Control of access to files ...... 200 7.5 Comparing 2 Operating Systems ...... 200 7.5.1 About Linux ...... 200 7.5.2 Memory management ...... 201 7.5.3 Processing modes ...... 203 7.5.4 Input and output ...... 204 7.5.5 Filing system ...... 205 7.5.6 Resource sharing ...... 206 7.5.7 User interface ...... 207 7.5.8 Applications support ...... 208 7.5.9 Security ...... 208 7.5.10 Comparing Windows 98 and Linux ...... 210 7.5.11 Summarising two operating systems ...... 211 7.6 Summary ...... 212 7.7 Revision Questions ...... 213 188 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

Learning Objectives After studying this topic, you should be able to:

¯ Describe the trend in GUI design towards increasing user convenience

¯ Explain how this trend leads to

– Increasing software complexity – Increasing demands on system resources – Increasing burdens on system performance

¯ Compare the demands made on system resources by CLI (command line interfaces) and GUI (graphical user interfaces)

¯ Describe the trend towards expanding the role of the operating system to provide more services to applications

¯ Describe how access control is achieved in single user and multi-user systems

¯ Describe typical backup facilities provided by operating systems

¯ Describe and compare the features of two specific operating systems

­c HERIOT-WATT UNIVERSITY 2006 7.1. PRIOR KNOWLEDGE AND REVISION 189

7.1 Prior Knowledge and Revision Before studying this topic, you should be able to

¯ Distinguish clearly between the operating system and application programs

¯ Describe the operation of command line and graphical user interfaces

¯ Distinguish between single user and multi-user systems

Revision Q1: Which of the following is not an operating system? a) MS Windows 2000 b) Unix c) MS Word d) Linux

Q2: On a particular computer system, the method used to load a file is to type  load:filename then press the enter key. The system is using a) a graphical user interface b) a command line interface c) Linux d) a speech-driven interface

Q3: Icons are used in a) command line interfaces b) graphical user interfaces c) all user interfaces d) all operating system interfaces

Q4: A multi-user system a) is another name for a multi-tasking systems b) is a computer that can be used by many users, but only one at a time c) is a type of local area network d) is a computer that can be used by many users simultaneously

7.2 The User interface The user interface is the means by which user and computer can communicate. The user interface gives the user access to the facilities that the operating system has to offer. There are two kinds of user interface: command-line interfaces (CLIs) and graphical user interfaces (GUIs).

­c HERIOT-WATT UNIVERSITY 2006 190 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

7.2.1 Command line interfaces Of the two, command line interfaces came first. A CLI is based on the idea of a screen that can takes lines of text. As a user, you enter a command as a line of text, by means of the keyboard. You read what you’ve typed on the screen. It doesn’t take much processing power to run an interface of this kind at an acceptable speed. A CLI does not take up a lot of room in memory. But there are problems. You have to learn what all the commands are, and you have to learn how to use them (their syntax). The names of commands are by no means always intuitively obvious. To use commands properly you have to learn a lot about switches and parameters. You have to type accurately. One missing letter, a full stop instead of a colon, and the whole thing could go horribly wrong. CLIs are designed in a no-nonsense, no-friendliness sort of way. If a command is successfully carried out, the CLI, by default, sends no signal to say so. It only sends a signal (an error message) if something’s gone wrong. These error messages are usually devised by people who might know everything about operating systems but find it hard to communicate with people who don’t. CLI error messages are notorious for not making sense to the uninitiated. You are able to to get on-line help, but you need to know quite a bit before you can even get it, never mind make sense of it. People find it discouraging to have to spend so much time getting started. On the other hand, people who have gone to the trouble of learning the commands in a CLI feel pleasantly in control of things: when they enter a command, there aren’t going to be any unintended side-effects. MS-DOS is one of the best known command line interface operating systems. It was the standard operating system on most PCs during the 1970s and 1980s, before the introduction of MS Windows 1.0 in 1985.

­c HERIOT-WATT UNIVERSITY 2006 7.2. THE USER INTERFACE 191

Figure 7.1: Typical MS-DOS command line interface

Advantages and Disadvantages of Command Line Interfaces On the Web is an activity that asks you to identify the advantages and disadvantages of a command based interface. You should now do this activity. 10 min

7.2.2 Graphical user interfaces Apple Macs had GUI interfaces from the start. Many Mac users feel that Windows has yet to catch up with what Macs had to offer in the mid 80s, in terms of a GUI that is user-friendly.

­c HERIOT-WATT UNIVERSITY 2006 192 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

Figure 7.2: With a GUI, the user is presented with an interface which is meant to be easier to use than a CLI. The user is presented with a graphical screen and points with a mouse to make choices. The screen contains pictures (called icons) and menus. Altogether, the user is presented with a WIMP environment (Windows, Icons, Menus, Pointer). You have to recognise things rather than remember commands and switches. In a GUI you are offered one or more windows. You can alter the size and position of a window. Windows that contain too much to fit on the screen at once will have scroll bars. Each window will contain icons (pictures to represent things) and menus. You can use a pointer, controlled by a mouse, to highlight icons or commands in menus; you select them by clicking on a mouse button. Such an interface is said to be user-friendly. Help is available at the click of a mouse button. You’re supposed to find out everything you need to know by looking at the screen rather than by thumbing through books. There are, of course, drawbacks. GUIs make great demands on the hardware resources of a computer system: they take up lots of memory and the processor has to be very fast if the screen, and the system as a whole, is to respond quickly enough to be satisfactory. Many users don’t feel in direct enough contact with the underlying operating system, and prefer to use CLIs. This is particularly the case when the user is trying to do something difficult. Such operations, if achievable at all in a GUI, seem to call for you to make adjustments in half a dozen different windows, a radio button here, a check-box there; it’s often a lot more straightforward to type in a few commands.

­c HERIOT-WATT UNIVERSITY 2006 7.2. THE USER INTERFACE 193

The Windows interface Windows provides a full GUI with a full WIMP environment. The layout on the screen is often known as the desktop. Users can modify this to suit themselves: they can configure the desktop; they can customise the interface. Each user of a Windows machine can have a profile that contains details of his or her preferred desktop configuration.

Figure 7.3:

7.2.3 Questions on the user interface Q5: Windows was the first GUI. a) True b) False

Q6: What does WIMP stand for?

Q7: What advantages might there be in using a CLI rather than a GUI?

7.2.4 Trends in User Interfaces User interfaces are constantly being developed and evolved. Competition between major software companies leads to a feature race, with each new version of an operating system including new features designed to make the system more convenient for the user.

­c HERIOT-WATT UNIVERSITY 2006 194 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

User Interface trends Look up the following web resources, take a look at some of the screenshots, and write a brief summary of the development of the user interfaces of MS Windows and Apple MacOS operating systems http://members.fortunecity.com/pcmuseum/windows.htm http://www.microsoft.com/windows/WinHistoryDesktop.mspx http://applemuseum.bott.org/sections/os.html As you have seen, the user interfaces have increased in complexity, with more features, higher resolution graphics, increased use of colour, and increasing options for the user to customise the look and feel of the interface. This has resulted in the operating system making increasing demands on the system. Early operating systems only required a few kilobytes of RAM, whereas modern operating system may require hundreds of Megabytes. Not only do modern operating systems require huge amounts of RAM, they also require fast processors, mainly because of the need to process high resolution graphics in real time. The development of increasingly user-friendly software and increasingly powerful computers is intertwined. Over the last 30 years, there has been an interdependent spiral of

¯ Improved hardware (faster, smaller, more complex, larger capacity, new materials)

¯ Improved software (more features, more user-friendly)

As hardware developers produce more powerful machines, software developers extend their products to use the new hardware, then make further demands, which lead the hardware developers to improve the hardware, which ...... and it shows no sign of stopping!

­c HERIOT-WATT UNIVERSITY 2006 7.2. THE USER INTERFACE 195

allows... more complex software which requires..

Improved Hardware

7.2.5 Comparing CLI and GUI system demands A GUI (graphical user interface) has many advantages over a CLI (command line interface/interpreter) - it is more intuitively easy to use, the user does not need to memorise (and enter correctly) commands, and the use of windows makes it much easier to implement multi-tasking. However, a GUI is much more demanding on system resources than a CLI. Consider a user wishing to load a file.

In a CLI-based operating system, the user simply types in the command - perhaps  LOAD "myfile.doc" and presses the Enter key. These key strokes are read into the keyboard buffer in the keyboard interface, then processed. The CLI recognises the keyword LOAD as a call to a subroutine within the operating system, which activates the file and memory management sub systems, to load the file. On completion of the load operation, the monitor will display a new command prompt to indicate that the file has been loaded correctly. In a GUI-based operating system, the user may double-click on a file icon using a mouse. The double click on the mouse button will generate an interrupt to the processor, indicating that the current process must be stopped to deal with the double click event. In order to "understand" what this double click means, the system must keep track of the position of the mouse pointer at all times, by receiving and processing the steady stream of electronic signals coming from the mouse’s movement sensor system (optical or mechanical). It must then compare the pointer position with the current screen display to identify what the use has double-clicked on. Was it a file? Was it an icon representing a command? Was it a "blank" area of the screen? Once this has been identified, the system then activates the appropriate operation within the file and memory management sub systems. Generally speaking, the file is then likely to be displayed in a window on the screen, or a dialogue box may be displayed indicating the status of the operation. This will require the operating system to create a new window or dialogue box, which may overlap with existing windows. The system must update the screen mapping in

­c HERIOT-WATT UNIVERSITY 2006 196 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

memory to effect this change in the display. It is clear from the above description that the GUI uses much more processor time, and requires much more use of memory, than the CLI. Modern systems, however, have faster processors and more RAM than earlier systems, so are capable of handling the additional overheads required by GUIs. Only in very specialised circumstances are CLIs now preferred.

7.3 Operating System Application Services Most of the situations we have described so far assume that the user is interacting directly with the operating system. However, in many situations, the user deals with an application program, and the application program calls on the operating system to carry out work on its behalf:

user user

program program user interface user interface (application) (application)

device device file manager file manager directory directory

I/O system I/O system (device drivers) (device drivers)

storage storage RAM RAM device device

For example, a user running a Word Processing package may select the Save option. The WP package will pass this request to the operating system.

7.3.1 Applications support Operating systems offer support to applications that run on them. Applications can make system calls. System calls are supplemented by subroutine libraries. These subroutines, grouped together, provide an Application Programming Interface or API.

­c HERIOT-WATT UNIVERSITY 2006 7.3. OPERATING SYSTEM APPLICATION SERVICES 197

Windows and other software Windows, in one form or another, is the most common operating system in the world. Many people regard this as a triumph of marketing rather than of software engineering. This means that a lot of software is written to run under Windows. All of this software is designed to be easy to use. Because it all uses the Win32 API, it all even looks much the same, which makes it relatively straightforward for a user to do at least the basic things in a new application. The Windows API The Windows API is called Win32. It contains many categories, which contain hundreds, even thousands, of functions that a programmer can use. Win32 offers the programmer the chance to use such system services as printer sharing, TCP/IP, and network communications. It enables different applications to use features of the operating system in the same way, so that the operations needed by the user and the windows presented on the screen in the course of, say, saving a file can be the same in applications written by different programmers. This naturally makes things easier for the user. (In the old days, under DOS, each application did things in its own way.)

7.3.2 Communication between processes Windows also allows communication between processes. The two techniques it has used are dynamic data exchange (DDE) and object linking and embedding (OLE). DDE was used in early versions of Windows. In DDE, a dialogue is set up between two applications. The application that wants information is known as the client and the application that is to supply it is known as the server. The client can start a dialogue, send commands, ask for data and send data. The server carries out commands, can give and receive data, and can also close the dialogue. The client often makes use of a macro to implement the exchange. OLE is used in Windows 98. It enables applications to assemble documents whose parts have been written with different applications (so that, for example, pictures can be placed in a text document). Once the document has been assembled, any part of it can still be edited using the application it was created in. The files that support OLE are DLLs. Each item in an assembled document is an object, with its own data and properties. As an obvious example, this paragraph could be treated as an object. The data would be the text; the properties would include the font and the paragraph format. (The assembled document is itself an object, though it is often described as a compound document). A linked object is held separately and only loaded when it is needed: its location (a full path and filename) is stored as part of the data of the assembled document. Linking in this way can cause problems when documents are moved about from one computer system to another (because one has to fiddle with paths). An embedded object becomes part of the data of the assembled document and is stored as such. The drawback to this is that you can easily get very large files (especially if the objects have been generated by a graphics package) that can be difficult to move

­c HERIOT-WATT UNIVERSITY 2006 198 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

around (on floppy disc or by e-mail). Microsoft has rationalised the development of OLE into a Common Object Model (COM). The idea is to allow (and encourage) other software houses to write applications that can share data under Windows.

7.3.2.1 Questions on applications support

Q8: What does OLE stand for? Q9: With respect to OLE, what’s the difference between linking and embedding?

7.3.3 Services provided to program by the operating system It is useful to summarise the services which the OS can provide to applications. These are

1. Providing a standard look and feel for applications - instead of each application using its own system of dialogue boxes for (say) printing or saving, every application can call on standard functions provided by the OS. This reduces program development time, reduces the scope for introducing bugs and makes it easier for users to learn new applications. 2. Simplifying and extending the graphic capabilities of applications - standard icons, window frames and other graphic elements are provided by the OS, so the application programmer again does not need to re-invent these. 3. improving the capability of programs to communicate and pass data - dynamic data exchange (DDE) described in the previous sub-topic, is a good example of this. 4. launching one application from within another, or allowing objects created in one application to be embedded in another (OLE), as described in the previous sub topic.

These features are possible, because the operating system provides a library of objects and services that can be called on as required by applications.

7.4 User Access Computers systems are expensive. The work people do on them is, as a rule, important both to themselves and, where they have them, their employers. Security is to do with the protection of computer systems from damage. This chapter is to do with what operating systems can do to help with security. Data can be corrupted or even destroyed. The damage can be done by the outside world (fire, flood, etc.) or by a user making a bad mistake or by something in the computer system such as a hardware fault or a virus. Windows was originally a single-user system, so its security techniques are fairly limited.

­c HERIOT-WATT UNIVERSITY 2006 7.4. USER ACCESS 199

7.4.1 Back up It’s a nuisance for anyone to lose data. For organisations that depend on computer systems, it can be catastrophic. The basic way of dealing with the problem is to have a copy of the data. Copying data for reasons of security is known as backing up. The idea of a back-up is that, should disaster strike the computer system, there should at least be a fairly recent version of the data available. Operating systems usually come with programs that can help with back-up. This program may be one of the accompanying utilities rather than actually part of the operating system. Back-up needs to be planned, especially in respect of when it’s to take place and where the back-up copies are to be stored. Human nature needs to be allowed for in such plans: if a back-up policy is too demanding, people might not follow it as enthusiastically as they should. Back-up usually takes place during the night when, as a rule, systems are at their least busy. All the same, complete back-up of a system can take a long time and use up vast amounts of (expensive) secondary storage. An alternative to making complete backups is incremental backup. A complete back up is made every so often (once a week or once a month, say). Every day a back up is made of all files that have been modified since the last full backup. This approach reduces the amount of storage space required but makes restoring data more complicated. Windows has its own backup program, inherited from DOS. For backing up their files over a network, however, users are recommended to use a backup agent, a network service supplied by a third party, such as ARCserve.

7.4.2 Passwords Passwords present the most basic way of protecting multi-user systems. All users are given an identity name (an id) and must have a password. Passwords can be given by the system administrator or chosen by the user. One problem is that many people tend to use obvious passwords (such as their own forenames) or, to save typing, use very short ones. Some systems check all suggested passwords and reject those that are judged unsatisfactory. Passwords are encrypted by the system and stored in that form. This should mean that anyone getting hold of the file of passwords should be none the wiser. When the user enters a password, it is encrypted and compared with the encrypted entry in the password file. Windows provides password protection that prevents unauthorised access to networks and shared resources, but it does not stop anyone using a computer as a stand-alone machine. On a peer-to-peer network, you can restrict access to folders or printers either by not allowing them to be shared at all or by using passwords. A problem with using too many different passwords, of course, is that people can’t

­c HERIOT-WATT UNIVERSITY 2006 200 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

remember them all. If a lot of different passwords are necessary, you need to keep a list of them, which needs to be readily available to selected users, which often makes it not unobtainable by other users.

7.4.3 Control of access to files Operating systems need to offer forms of protection for files. Write access means that users can alter the file and, in practice, do anything they feel like with it. Read access means that the user can see the file, and copy it, but not alter it. Whatever the user tries to do to the file, it will remain unchanged. Files in Windows have attributes. They can be made read-only or hidden, which can protect them from accidental damage, though another user can always change the file attributes. Files can also be protected by password.

7.5 Comparing 2 Operating Systems In this Topic, we shall be reviewing two operating systems in respect of the features they provide. One is Windows 98, which has already been described, to some extent, in the last Topic. The other is Linux. It should be made clear from the first that the two systems not just technically different; they are, so to speak, morally different. Windows is a commercial system, created by the programmers of a gigantic software house. It’s there to make money. It has done this with great success. Linux is a free system, as a matter of principle. You can download it from the Internet for nothing. It is an example of open source software.

7.5.1 About Linux In the 1970s, Bell Labs and other organizations created the powerful UNIX operating system. The man chiefly responsible was Ken Thompson. In those days, computers tended to have their own operating systems, provided by the manufacturer (IBM, for example). This meant that, as a rule, they could not communicate with each other. Their operating systems were closed systems. The idea behind Unix was that it should be an open operating system. This would have two advantages: computers running it would be able to communicate, and computer operators of all kinds could move easily from one computer to another instead of having to spend weeks or months learning the operating system of each new machine they used. Unix became very popular in universities and other places of research because the people there approved of these advantages. Furthermore, Unix was written in C and came with a good scripting language, so that people could configure it to meet their

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 201

own (often fairly demanding) requirements. Unix became the foundation of most of the servers on the Internet. It can run on PCs and is the chosen operating system for supercomputers such as Crays. However, Unix comes in different flavours, though efforts are being made to standardise it, and it can be expensive. In the early 1990s, a Finnish student named Linus Torvalds wanted to run Unix on his PC, but couldn’t afford to. So he decided to write his own operating system from scratch. He wrote the ‘kernel’ (the inner core of the system) and obtained most of the application software, utilities, and programming tools from the Free Software Foundation’s GNU sources. With harmless egoism, he called the system Linux. It wasn’t Unix but it looked like Unix. He offered it on the Internet, for nothing. He provided the source code, so that people could get right into the system and fiddle about as much as they liked. (The technical term for this is open source.) They could use it, make improvements to it, write utilities for it, give away copies to other people; the main condition was that they had to make their improvements available to Linux. The idea caught on. Today there are tens of thousands of people all over the world happily working away at Linux, altering it, fixing bugs in it, writing applications and games for it. Many large organisations (such as NASA) use Linux. Companies in the computing industry such as IBM, Intel and Dell have announced support for it. More and more commercial software houses are adapting their applications to run under it. If you’ve ever surfed the Internet, you are practically bound to have landed on sites that have been created under Linux. Corporate vendors offer what are called distributions of it; the Linux part is still free, but the vendors offer technical support and work at enhancements to make the thing easier to install. Linux is written by programmers for programmers, and vendors present it in a form suitable for people who simply want to use it and don’t want to have to know lots of stuff about computers. Torvalds remains head of the community and decides on what changes will be made. He earns a salary in his day-job but he makes nothing out of Linux, though he has deservedly become one of the great men of computing and receives lots of invitations to address multinational computer conferences.

7.5.2 Memory management 7.5.2.1 Memory management in Windows 98

Windows 98 is a single-user system. Its memory management does not have to be so complicated as the memory management of a multi-user system. Windows 98 offers multi-tasking and multi-threading with pre-emptive scheduling. Memory management is handled by the Virtual Machine Manager. The VMM creates and manages a single virtual machine (VM) that contains all Win32 applications (and the Win32 API subsystem). Furthermore, any DOS application is given a virtual machine. This in effect fools it into thinking it has control of the processor and is the only process in memory. Each virtual

­c HERIOT-WATT UNIVERSITY 2006 202 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

machine has its own memory space and access to devices. More than one virtual machine can be running on one real machine. The VMM maintains one virtual machine for each DOS application and one for all the Windows applications. The memory model is shown in Figure 7.4. 1 GB for system code

1 GB for DLLs etc.

2 GB process memory

Figure 7.4: Windows 98 memory model

7.5.2.2 Memory management in Linux

Linux is designed to run on a variety of hardware platforms. It is, like Windows, a multi-tasking, multi-threading system but, unlike Windows, it is a multi-user system. Linux, like Windows, runs a virtual memory system, which offers:

¯ a large address space for each process (much larger than physical memory);

¯ fair allocation of real memory between processes;

¯ protection, so that all processes are protected from each other (and a crashing process won’t affect the other processes or the operating system);

¯ shared virtual memory, so that two processes can use the same library (rather on the lines of Windows’ DLLs).

Processes are divided into pages. Linux uses demand paging, so that only the pages that a process demands are brought into memory. When the system begins to run out of memory, Linux uses:

¯ swapping, so that pages that have been modified are swapped out to a swap file;

¯ discarding pages, so that pages that have not been modified and can easily be

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 203

read back in from file are discarded.

Linux allocates pages according to the buddy system. (The size of the pages will vary according to the hardware, but is often about eight kilobytes.) Pages are allocated in blocks which are powers of 2 in size (1, 2, 4, etc). A process asks for a block of a certain size. The memory manager looks through the list of blocks of that size. If one is free, the process gets it. If not, the manager looks for blocks of the next size up. The process continues until all free memory has been searched or a space has been found. If the block allocated is larger than that needed, it’s broken down (in powers of two) until there is a block of the right size. For example, a process wants a block of 2 pages and there are no blocks of that size free; there are no blocks of size 4 free either; but there is a free block of size 8; that block is divided into two blocks of 4, one of which is divided again into two blocks of 2 and given to the process; the spare blocks of size 2 and 4 are added to the lists of free blocks.

7.5.2.3 Questions

Q10: Windows 98 is a multi-user system. a) True b) False

Q11: Windows 98 is designed to run on a single processor. a) True b) False

Q12: Linux is a multi-user system. a) True b) False

Q13: Linux can run on a PC. a) True b) False

Q14: Windows 98 uses paging. a) True b) False

Q15: Linux uses the Buddy system of memory management. a) True b) False

7.5.3 Processing modes Windows 98 allows multi-tasking and multi-threading: it can run a lot of tasks, and each task can run a lot of threads. In fact, Windows 98 regards a process as an instance of an application with at least one running thread. The scheduler controls all the running

­c HERIOT-WATT UNIVERSITY 2006 204 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

threads. It uses pre-emptive scheduling when more than one thread demands the processor. Linux allows multi-tasking and multi-threading, and does so for multi-users. It can also run lots of processors, whereas Windows is designed for a single user and a single processor. Note: terms can become confusing. Multi-processing is normally taken to mean running lots of different processes rather than lots of different processors. In Linux, processes are put in what’s called a run queue. Each process is given a priority (by default, processes are equal and have the same priority). The processor handles jobs in order of priority. If several processes have the same priority, Linux uses round robin scheduling to give each of them time slices in turn. Because of its reliability, Linux is being used more and more to handle real-time processes. Such processes are given very high priority. They can be handled on a round robin basis, or simply as a queue, first come first served, depending on how the real-time system is meant to behave.

7.5.3.1 Questions

Q16: Linux supports multi-threading.

a) True b) False

Q17: Windows 98 supports multi-threading.

a) True b) False

Q18: Windows 98 uses preemptive multi-tasking.

a) True b) False

Q19: Round robin scheduling is a kind of preemptive scheduling.

a) True b) False

7.5.4 Input and output In Windows 98, device drivers are DLLs. The main advantage of this is that the drivers common to running applications (such as the display driver) need only to be loaded into memory once and can be shared by the different applications. Windows 98 supports a range of plug and play devices. At present, you would hardly describe Linux as supporting plug and play, but they’re working on it. In Linux, device drivers are part of the kernel and, to install a new device driver, you rebuild the kernel to include it. Once the kernel has been rebuilt, you need to reboot the system. Windows and Linux both allow devices direct memory access.

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 205

Linux, like Unix, is designed to provide a uniform system for I/O operations. To this end, the kernel treats devices as files, albeit slightly special ones. Each device is represented by a special file that has a place in the directory structure (usually in a directory called dev). From the user’s or programmer’s point of view, this means that transferring data to or from a device is just like transferring data to or from a file: the system calls for opening, closing, reading and writing are just the same. In Linux, the device drivers are linked into the kernel. When a new device is added to the system, Linux detects this when the system is booted and links the appropriate device driver into the kernel. In Windows, device drivers are implemented as DLLs: new drivers can be added without the Windows code being affected. Some device drivers implement virtual devices (for example, to enable remote logins). Linux supports a wide range of terminals for a wide range of computers. Windows works with screens that can be run by a PC.

7.5.4.1 Questions

Q20: Device drivers in Linux are known as DLLs. a) True b) False

Q21: Windows 98 supports plug and play. a) True b) False

Q22: Linux supports plug and play. a) True b) False

7.5.5 Filing system Windows 98 uses the FAT32 system which allowsa4KBcluster size on hard discs. This (compared, say, to the 32KB clusters available in previous versions of Windows) is small and leads to less waste of space. The FAT32 system supports long filenames, with spaces in. Files can be organised in a hierarchy of folders. The File Manager utility, which comes as part of Windows, makes it easy to move, copy and delete files and folders.

Linux also supports long filenames, but using spaces can lead to problems. (People

L L L L L tend to use the underscore instead.) ManyLusers regard this as a nuisance. In Linux directory information is held, for the most part, in the files rather than in the directory. Each file has associated with it an information node (inode) which contains all the information about the file, such as who owns it and how big it is. A directory entry contains merely a filename and a pointer to the relevant inode. A file can have entries in many directories, each potentially with a different name, although all will refer to the same inode.

­c HERIOT-WATT UNIVERSITY 2006 206 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

For file storage, Windows 98 uses clusters, which are multiples of a disc’s physical sector size. Linux works directly with single sectors, which leads to a more efficient use of disc space. A volume is a floppy disc drive or CD-ROM drive or a partition of a hard disc. In Windows, each volume is regarded as a separate drive (A: C: D: etc.). These drives are all called by letter (which can become a nuisance on a large network). In Linux, all volumes can be placed in the same directory tree structure. A new volume (a ROM drive, for example) can be mounted anywhere you like in the directory structure. This gives a smoother directory structure to work in.

7.5.5.1 Questions

Q23: Linux maintains a unified directory structure. a) True b) False Q24: You can have spaces in Linux filenames. a) True b) False Q25: Windows always stores files in single sectors on disc. a) True b) False

7.5.6 Resource sharing Windows 98 is a single-user system and cannot be used to serve a multi-user network. It can be used by the machines on a peer-to-peer network. Windows 98 supports a selection of network protocols: TCP/IP, so that a Windows machine can work easily with a Unix host or with the Internet; IPX/SPX, so that the computer can stand as a on a Novell network; and NetBUI. The file system and devices can be made accessible to other Windows systems, protected by passwords. Windows offers software for connection to the Internet. Its Internet Explorer is said to be part of the operating system. It has been a matter of debate whether you can argue that a browser should be part of an operating system, or whether Microsoft were simply contriving to exclude the opposition. You certainly can’t get rid of the Internet Explorer altogether (experts have tried). You can download Netscape Navigator (and other browsers) for nothing. Linux is designed as a multi-user system and is therefore built to run networks and allow the sharing of resources and, moreover, the supervision of such sharing. The network manager can maintain a single list of all system users and put users into groups; members of a group have the same access to system resources. It allows files to be shared, so that a user can create files and make them available to other users of the network; on the other hand, the user who owns a file or the system manager can restrict access, either to everyone or merely to certain groups or users. The system manager

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 207

can also hide certain files completely, from everybody or from certain groups.

7.5.6.1 Questions

Q26: Windows 98 permits peer-to-peer networking. a) True b) False

Q27: In Windows 98, resources like printers can be protected by password. a) True b) False

7.5.7 User interface Windows 98 comes with a GUI that provides a full WIMP environment. This makes it easy for the user to do relatively simple things, such as load an application. A GUI is essentially interactive. (The drawback is that it’s not designed to be programmable.) It is possible to get out into a DOS-like command environment. It’s easy to get started in Windows. On the other hand, doing advanced things can be very complicated and exasperating: either you need to do lots of different things in lots of different windows or you need to get in and edit things like the registry (which handles configuration) by hand, just as you would in a CLI. Each user of Windows 98 on a given machine has what is called a user profile. The profile includes settings for the desktop, email, and so on. However, it needs to be set on every machine the user wants to use. They’re working on the user-friendliness, but at the moment it should be said that Linux is not ideal for complete beginners. On the other hand, it is much preferred to Windows by people who want to do advanced things. Linux basically provides a command line interface. It offers a selection of what are called shells. (A shell surrounds a kernel.) Each shell has its set of commands. Each command will have a set of options or switches that can be used to shape its behaviour. Commands can be typed in directly. The shell acts as a command interpreter, executing the command if possible, sending an error message if not. A series of commands can be brought together to make a shell program or script. A script can be a straightforward sequence of commands (to carry out, say, a sequence of actions that you often want the computer to perform) but it can be a lot more than that. The scripting language is pretty much a programming language. It provides high- level programming structures such as selection, repetition and modularity. You can use variables, which can be assigned values, and parameters. You can ask for input from the user at run time and process it. The shells and the scripting language offer the user precise control of the operating system. However, not everyone particularly wants to learn commands and switches and all the rest of it. So Linux also offers a selection of GUIs (rather than just one).

­c HERIOT-WATT UNIVERSITY 2006 208 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

7.5.7.1 Questions

Q28: Shell is a name for a command interpreter.

a) True b) False

Q29: Windows 98 offers a full WIMP environment.

a) True b) False

7.5.8 Applications support The great strength of Windows comes from its long dominance of the market. By default, applications for PCs are written to run under Windows. Microsoft itself has written many good applications, such as the word processor Word and the spreadsheet Excel. If you want software for a Windows machine, there’s an awful lot to choose from. Applications for Windows make use of the API (application programming interface). This means that applications for Windows tend to look much the same and to work in much the same way. The menus and dialog boxes come from the API rather than from the application developer. The application can also use the API to look after the details when it comes to using devices. As yet, there is nothing like the same range of software available to run under Linux. Wordperfect runs under it; so does Netscape Navigator, and so on. But, if you’re moving from a Windows system, you can’t assume that equivalent packages to the ones you’ve been used to will be available.

7.5.8.1 Questions

Q30: Windows 98 supports only Microsoft applications.

a) True b) False

Q31: Applications for Linux have to work in text mode.

a) True b) False

7.5.9 Security Perhaps the basic threat to a computer system is a hardware failure. No operating system can protect you against that. What they can do is make restoration possible, so that, once the hardware fault has been fixed, the system can be put back to normal. All networks are vulnerable, especially in respect of a failure of one of the servers. This is true even on a peer-to-peer network if machines on it are being used as the servers to printers or backing store. If a workstation goes down, the performance of other machines on the network should not be adversely affected.

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 209

Windows 98 is designed for a single user, even though most computers these days at places of work or in educational establishments are networked. Passwords are used to prevent access to shared resources on the network, but there’s nothing to prevent someone using a Windows box as a stand-alone machine Windows offers further password protection for shared resources on a network. It allows users to give files a certain amount of protection (by making them read-only or hidden), but such protection is only of any use against accidental damage. Another user can always alter the protection on a file. Linux is a multi-user system and takes security much more seriously. All users have IDs and passwords, and a valid combination has to be entered before you can do anything at all. Some systems even check that your suggested password is obscure enough. The system logs password attempts and can be set to permit only a limited number of attempts, as a defence against hackers. Passwords are saved in encrypted form. When a password is entered, it is encrypted, and the encryptions are compared. Apart from the system manager (often known as a superuser or root) there are three sorts of people on a Linux system: the user, the group that the user belongs to, and everyone else. Users are grouped by the manager. As a rule, the groupings are fairly obvious (for example, students grouped in classes, staff grouped according to subject or department and whether teaching or non-teaching, and so on). Such groups can also be used in e-mail. Each of these sorts (user, group, everybody) can be afforded permissions to files. The permissions are:

¯ r - read;

¯ w - write;

¯ x - execute; if the file is executable, this allows it to be executed.

Permissions to a file can be changed only by the owner (the user) or a superuser. This gives you full control over your files (unless you have exceptionally meddlesome superusers to put up with): you can keep them to yourself, you can make them available to other people, as you think best. The permissions to a file might look like this:

User Group Everybody rwx r-x rÐ read read read write execute execute In this case, the owner can do whatever he likes with the file; anyone in the same group can read the file and run it but not alter it; people outside the group can read it. Linux has full backup facilities.

­c HERIOT-WATT UNIVERSITY 2006 210 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

7.5.9.1 Questions

Q32: Windows 98 has no backup facilities.

a) True b) False

Q33: Linux uses permissions to help protect files.

a) True b) False

Q34: In Windows 98, anyone can alter the access rights to a file.

a) True b) False

Q35: In Linux, only the owner or a superuser can alter the permissions to a file.

a) True b) False

7.5.10 Comparing Windows 98 and Linux Range of platforms The range of computers that Windows can run on is extremely limited. Windows runs only on PCs with (single) Intel processors. Linux runs on a wide range of machines, including ones with lots of processors. Cost The cost of an operating system on a PC is usually hidden as part of the purchase price. Windows 98 on its own costs about 100 pounds. Linux is free, though a distribution version can cost about 50 pounds. Ease of Installation Windows is easy to install because it’s designed to run on one kind of chip. Linux, because it’s designed to be flexible, is a little more difficult. On the other hand, you can get free online technical support for Linux (from enthusiasts), which is not available for Windows. Reliability It’s not easy to get statistics about the reliability of operating systems. Often, estimates of reliability come down to hearsay and anecdote. For example, when the writer of these notes sat down to write them, his Windows laptop froze as he typed in the first character of his password. There’s no telling what the trouble was, but it seemed to be that he hit the key at exactly the moment when the automatic virus-scanner was terminating. The keyboard became inoperative: not even Ctrl-Alt-Del could get through. There was nothing for it but to switch off, start again, and receive messages about not having shut down the computer properly. Windows frequently hangs on the laptop two or three times a day (though this might, at least in part, be something to do with the hardware). Linux, which he uses whenever he can,

­c HERIOT-WATT UNIVERSITY 2006 7.5. COMPARING 2 OPERATING SYSTEMS 211

has never yet hung on him. It can happen - it’s called a kernel panic - but never in his experience. No software of any complexity will be free of bugs. In the case of operating systems, users and even applications are unpredictable, and there will be various and barely foreseeable conjunctions of processes, interrupts, memory mappings, and I/O operations that the system just can’t cope with. The coders at an organisation like Microsoft are always under pressure, from deadlines, from the need to keep up to date and even produce novelty, and from various other corporate imperatives. Linux goes at its own pace. One reason for Linux’s reliability is that there are all these fanatics poring over the code every day. If anyone spots a bug, people race each other to fix it first. Most bug fixes are accomplished in a day or two: you don’t have to wait months (and then pay money) for the next version. Speed Besides being reliable, Linux does run very fast. Windows is slow by comparison.

7.5.11 Summarising two operating systems Comparing and contrasting When you’re asked to compare two things, you’re expected to say in what respects they’re alike. When you’re asked to contrast them, you’re expected to say how they 15 min differ. Copy the following table onto paper and fill it in. Note the similarities and differences between Linux and Windows 98. Linux Windows 98 Memory management Processing modes Input / Output Filing system Resource sharing User interface Applications support Security

Reporting on 2 operating systems For this activity, you will write a report comparing 2 operating systems that you have actually experienced. This report is required as evidence for Outcome 2 of this unit. Stage 1: Select 2 operating systems

¯ Choose 2 operating systems to which you have access, for example, Windows XP and Mac OS X, or Windows 2000 and Linux, or MS-DOS and Unix. At least one of them must be a current operating system.

­c HERIOT-WATT UNIVERSITY 2006 212 TOPIC 7. OPERATING SYSTEM DEVELOPMENTS

Stage 2: User Interface

¯ Obtain a screen dump of a typical view of each operating system

¯ Annotate these screen dumps showing the major features of the 2 systems

¯ Write a paragraph outlining the main differences and similarities between the two user interfaces

Stage 3: Technical Requirements

¯ Research the main technical requirements for each of your chosen operating systems, including

– Processor type – Minimum clock speed – RAM required – Hard disc space required

Stage 4: Underlying technology

¯ Research how the two systems implement any three of the following

– Memory management – File management – Process management (scheduling) – Input/output – File and folder access rights – Backup facilities

¯ Write a short paragraph describing how your chosen operating systems implement these aspects

Stage 5: Submit the report to your tutor for checking.

7.6 Summary

¯ Operating Systems have evolved from CLI-based systems to GUI designs, leading to

– Greater user convenience – Increasing software complexity – Increasing demands on system resources – Increasing burdens on system performance

­c HERIOT-WATT UNIVERSITY 2006 7.7. REVISION QUESTIONS 213

¯ Operating Systems provide increasing services to applications, leading to

– reduced development time for applications – a standard look and feel for applications – improving capability for applications to communicate with each other – improving capability for document embedding

¯ Operating Systems provide access control in single user and multi-user systems, using

– Log-in and password systems – File and folder attributes

¯ Most Operating Systems provide built in backup facilities

7.7 Revision Questions Q36: A command line interface (CLI) a) uses icons to represent files, folders and operations b) requires the user to type in all commands c) is a speech-driven interface to an operating system d) makes more demands on system resources than a GUI

Q37: The Linux operating system a) has a graphical user interface b) is a version of Unix for PCs c) is an open source operating system with a CLI d) can run on a PC but not on a Mac

Q38: DDE a) is an operating system b) is a method of allocating processor time in a multi-tasking environment c) is a method of memory management d) is a method of allowing applications to exchange data

Q39: File attributes a) allow access to files to be controlled in a multi-user system b) are only effective in multi-user systems c) are only used in CLI-based operating systems d) are no longer used in modern operating systems

­c HERIOT-WATT UNIVERSITY 2006 214 GLOSSARY

Glossary Arithmetic and Logic Unit A part of the processor where data is processed and manipulated, including arithmetical operations and logical comparisons.

Assembler A program which converts instructions written in assemble language (mnemonics like LDA #25) into machine code binary instructions.

Backing Storage A means of storing large amounts of data permanently outside the computer system. For example: hard disc, floppy disc, optical disc (CD or DVD), memory stick.

Binary number system A system in which numbers are represented using only the digits 0 and 1.

Bit A single unit of binary data.

Boolean Two-state algebra developed by George Boole.

Bootstrap program A program that loads and initialised the operating system on a computer.

Byte Eight units of binary data (bits).

Cache Memory Fast access storage located between the processor and main memory.

Command-line interfaces (CLIs) Method of interacting with a system by typing keywords using a keyboard. All commands and responses are in the form of text.

Complex Instruction Set Computer Processor families that follow the philosophy of increasingly complex instruction sets and addressing modes.

Complex Instruction Set Computers Processor families that follow the philosophy of increasingly complex instruction sets and addressing modes.

Control Unit A part of a processor which manages the execution of instructions by fetching, decoding, then sending out control signals to other parts of the processor or computer.

­c HERIOT-WATT UNIVERSITY 2006 GLOSSARY 215

Data dependency A condition where an instruction cannot be executed because it requires data produced by an earlier instuction which is still being processed in a pipeline.

Device drivers A piece of system software that handles a particular device (e.g. a printer driver)

Directive Directives are instructions that control the operation of the assembler program. That is they provide information to enable the assembler properly to store the machine code prior to its execution.

Fetch-Execute Cycle The repeated process of fetching instructions from main memory, decoding the instructions and executing them until an instruction to HALT is encountered

Graphical user interfaces (GUIs) Method of interacting with a system by using icons to represent files, folders and processes. Commands are entered by using a pointing device such as a mouse.

Hardware The physical components of a computer system.

Hexadecimal A number system with a base of sixteen.

Instruction pipelining The idea that the concept of an assembly line can be applied to the fetch-execute cycle of a processor.

Instruction Register A register which holds the instruction currently being decoded or executed.

Instruction Set The list of all possible machine code instructions (op-codes) for a given processor.

Instruction sets The list of all possible machine code instructions (op-codes) for a given processor.

Internal bus The medium for communicating data and control signals between the component parts of the processor.

Kernel The core of an operating system, which remains in memory at all times, and provides basic services to the rest of the operating system.

Logic gates A device that performs a logical operation upon its input signals to produce its output signals.

­c HERIOT-WATT UNIVERSITY 2006 216 GLOSSARY

Memory Address Register A register which holds an address about to be sent along the address bus.

Memory Data Register A register which holds data which has just arrived along the data bus, or is about to be sent along the data bus. Also known as memory buffer register.

Microprogramming An approach where each complex machine code instruction is split into a series of simpler instructions called microinstructions.

Mnemonic An easily remembered code (often 2 or 3 letters such as TAX, LDA) which represents a machine code instruction. Mnemonics are used in assumbly languages.

Multi-tasking Where several applications are open at the same time, and the user can switch easily between them.

Multi-threading One process can have several threads. The threads act independently of one another, but share the same memory space.

Opcode Part of a machine code instruction which defines the operation to be carried out.

Operand Part of a machine code instruction which defines the data (or address of the data) to be used.

Pre-emptive scheduling Allocation of processor time to processes under the control of the operating system.

Process A program in execution.

Program Counter A register which holds the address of the next instruction in a program.

Reduced Instruction Set Computer A processor whose design is based on the rapid execution of a sequence of simple instructions rather than on the provision of a large variety of complex instructions.

Registers Locations within the processor where individual items of data can be stored temporarily.

­c HERIOT-WATT UNIVERSITY 2006 GLOSSARY 217

SIMD (Single Instruction Multiple Data) A microprocessor architecture allowing a single instruction to be applied to several items of data (held in registers) simutaneously.

Software The programs that run on hardware.

Spooling The temporary storage of input or output data on magnetic disk or tape. Preferred method used when a peripheral is shared across a network or when large data files are being transferred

Stack Pointer A register which holds the address of the next free location in a stack.

Stored program concept The idea that the sequence of instructions to solve a problem should be stored in the same memory as the data.

Superscalar processor Superscalar processors are processors with multiple execution units to improve performance.

System bus The medium for communicating data and control signals between the main components of the computer.

TCP/IP Transmission Control Protocol/Internet Protocol (TCP/IP) is the basic communication protocol of the Internet. It can also be used as a communications protocol in a private network.

Von Neumann architecture Von Neumann architecture is based upon the basic idea that the sequence of instructions to solve a problem should be stored in the same memory as the data.

­c HERIOT-WATT UNIVERSITY 2006 218 ANSWERS: TOPIC 1

Answers to questions and activities 1 Computer Structure Revision (page 2)

Q1: b) all types of data and instructions within a computer system

Q2: a) hard disk

Q3: c) within the processor of all types of computer

Q4: c) the data bus

Q5: d) palmtop, laptop, supercomputer

Constructing a comparison table of storage devices (page 18)

Q6: decreases

Q7: increases

Q8: increases

Q9: decreases

Answers from page 20.

Q10: Register Cache RAM Hard Disc CD DVD Tape

Q11: No - the order would be the same.

Answers from page 21.

Q12: Because it is easy and cheap to manufacture two-state devices.

Q13: Asa7or8-bit pattern of 0s and 1s.

Answers from page 21.

Q14: 9

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 1 219

Q15: 6

Q16: 13

Q17: 0101

Q18: 01010111

Q19: 0.11

Q20: 15

Q21: Solution to last question was 1111 in binary, which is 15 decimal. Adding 1 gives 16, or 24. Hence largest number in 4 bits is 24-1. Generalising this to n bits - largest number is 11111...11, adding 1 gives 100...0000 which is 2n hence largest value that can be stored in n bits is 2n-1.

Answers from page 23.

Q22: 13

Q23: 53

Q24: 10011

Q25: 1010001

Answers from page 24.

Q26: EB

Q27: 5F

Q28: 42

Q29: 205

Answers from page 25.

Q30: a) a group of wires which carry data between the processor, memory and peripheral devices

Q31: c) between registers and RAM

Q32: d) data transfer rate increases

Q33: c) 0011011

Q34: b) 2C

Q35: a) 1111 in binary, which is 15 in decimal

Q36: a) a base 16 code used to write binary numbers in a shorthand way

­c HERIOT-WATT UNIVERSITY 2006 220 ANSWERS: TOPIC 2

2 Processor Structure Revision (page 28)

Q1: b) data storage locations within a processor

Q2: c) the control unit

Q3: c) to send addresses from processor to memory

Q4: d) the memory data register (MDR)

Q5: d) the ALU, control unit and registers are parts of the processor

Answers from page 29.

Q6: Control Unit

Q7: Arithmetic and logic unit

Q8: Registers

Q9: Internal Processor Bus

Answers from page 30.

Q10: Instruction Register

Q11: Program Counter

Q12: General Purpose Register

Q13: Memory Data Register

Q14: Memory Address Register

Answers from page 33.

Q15: Fetch instruction from memory, decode the instruction, increment the program counter and then execute the instruction.

Q16: At the fetch stage contents of program counter are placed in MAR and instruction is fetched from MDR into instruction register. At decode stage any operands of the instruction that reside in memory must be fetched, hence again the MAR and MDR are used. At the execution stage it may be necessary to place a result back in memory so again the MAR and MDR will be used.

Q17: The contents of the next memory location would be loaded into the Instruction Register and an attempt would be made to interpret it as an instruction even though this word may have no connection with this program at all or may be part of its data. If interpretation as an instruction is possible then the instruction is executed and the next word fetched as an instruction. This sequence of events would proceed until the word

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 2 221

fetched could not be interpreted as an instruction or possibly when an attempt was made to access an illegal memory location. The operating system would then terminate the program.

Answers from page 42.

Q18: a) holds the address of a memory location about to be accessed

  Q19: a) fetch instruction  increment PC decode instruction execute

Q20: b) widening the address bus

Q21: a) 64 bits

Q22: c) is used to send data to and from peripheral devices

Q23: c) L1 cache is faster to access than L2 cache

Q24: c) which requires successive instructions to be stored on different memory chips

Q25: c) direct memory access

­c HERIOT-WATT UNIVERSITY 2006 222 ANSWERS: TOPIC 3

3 Processor Function Revision (page 47)

Q1: b) the low level language specific to a particular microprocessor

Q2: c) program counter (PC)

  Q3: a) fetch instruction  increment PC decode instruction execute instruction Q4: d) the memory data register (MDR) Q5: b) the data bus is 8 bits wide, but the address bus may be more than 8 bits wide

Answers from page 51.

Q6: Divide by 2.

Matching 6502 registers (page 55)

Q7: Register Name Abbreviation Bits Purpose Accumulator A 8 General purpose register - holds result of any calculation Index Registers X and Y 8 General purpose registers, used for loop counters Memory Data MDR 8 Holds data arriving from data bus, or Register about to be sent on data bus Memory MAR 16 Holds address to be sent out on address Address bus Register Program PC 16 Holds address of next instruction to be Counter fetched Instruction IR 16 Holds instruction currently being Register decoded or executed Status Register SR 8 Holds 7 individual flag bits Stack Pointer SP 8 Holds address of first empty location in 256 bit stack

Answers from page 56.

Q8: a) a Q9: c) c Q10: d) d Q11: b) b

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 3 223

Answers from page 92.

Q12: a) 10010110 11001010

Q13: b) LDA 24A7

Q14: c) ADC is the op-code and #56 is the operand

Q15: a) a data transfer instruction

Q16: c) a shift instruction

Q17: d) direct addressing

Q18: d) LDA 2345

Q19: c) a comment

­c HERIOT-WATT UNIVERSITY 2006 224 ANSWERS: TOPIC 4

4 Processor Performance Revision (page 97)

Q1: b) an op-code followed by one or more operands

Q2: c) assembly language is a human representation of binary machine codes

Q3: c) a list of all its machine code instructions and addressing modes

Q4: d) direct addressing

Q5: d) TXA

Answers from page 101.

Q6: Backward compatibility is ensured in a processor family by ensuring that features, for example new instructions, new registers, etc. are only added to the design but that no existing feature is ever removed.

Q7: 1, 4, 6

Answers from page 108.

Q8:

1. instructions of different lengths 2. data dependency 3. branch instructions

Answers from page 112.

Q9: b) have a smaller, but more efficiently executed, instruction set, than CISC processors

Q10: c) applies the same instruction to many items of data simultaneously

Q11: b) involves a processor dealing with several instructions simultaneously in stages

Q12: a) it may result in a jump to an instruction which is not next in the pipeline

Q13: c) branch prediction

Q14: c) involves a processor being able to operate two or more pipelines simultaneously

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 5 225

5 Processor Development Revision (page 117)

Q1: b) has more addressing modes than a RISC-based microprocessor

Q2: c) a superscalar processor

Q3: c) addressing modes

Q4: d) it splits instructions into stages which allows more than one instruction to be processed at the same time

Answers from page 128.

Q5: 8086 10 MHz

80286 16 MHz

80386 25 MHz

80486 100 MHz

Pentium 166 MHz

Pentium Pro 200 MHz

Pentium II 600 MHz

Pentium III 733 MHz

Q6: 6-stage pipeline introduced in 80386 chip

Q7: (a) 8 registers (b) 8 registers (c) 16 registers (d) 24 registers

Q8: The Pentium was the first superscalar X86 chip.

Q9: CISC because - large instruction set, few registers, many addressing modes

Answers from page 133.

Q10: Apple, Motorola, IBM

Q11: the 601 in 1993

Q12: it has 3 independent processing units - the floating point unit (FPU), the integer ALU, and the system unit

Q13: 2 sets of 32 registers, each 64 bits wide

­c HERIOT-WATT UNIVERSITY 2006 226 ANSWERS: TOPIC 5

Q14:

a) similar - X86 has 235 different instructions, PowerPC has 225 b) X86 has varied instruction lengths (1-11 bytes), the PowerPC instructions are all exactly 4 bytes c) the X86 has 11 addressing modes, the PowerPC has only 2

Q15: L2 "backside" cache on chip

Q16: d) G4

Q17: because the Mac uses the more efficient RISC architecture, a Mac with a lower clock speed may outperform a Windows PC with a higher clock speed

Q18: IBM servers, Nintendo Game Cube, and a range of embedded applications

Answers from page 137.

Q19: VLIW = very large instruction word; the IA-64 fetches a 128 bit bundle containing 3 41-bit instructions during each memory fetch

Q20: yes, it has 11 execution units which can operate in parallel

Q21: branch prediction mean "guessing" whether or not a branch will be taken, and executing following instructions accordingly - if the prediction is wrong, the pipeline will stall; predication means executing instructions from both branches simultaneously, and discarding the results from the branch which is not required

Q22: due to its parallel execution units, 10 stage pipeline, VLIW memory accessing and use of predication and speculative loading, the Itanium can process up to 20 operations per cycle.

Answers from page 144.

Q23: b) an X86 series processor

Q24: c) a PowerPC processor

Q25: b) in a 128 bit bundle containing 3 instructions and a pointer

Q26: a) = 1024 x 1024 x 1024 Gigabytes

Q27: c) symmetric multiprocessing, a system where several processors of equal status cooperate in a parallel computer system

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 6 227

6 Operating System Functions Revision (page 149)

Q1: b) MS Office XP

Q2: c) file management

Q3: c) ensuring that a program does not overwrite the OS when loaded into RAM

Q4: d) input/output management system

Answers from page 161.

Q5: A multi-user system may have many users accessing the system at the same time, e.g. a network server. A multi-tasking system may only have a single user, who is running several programs simultaneously, e.g. using a word processor, while an e-mail client is running. A single application may be treated by the processor as a number of different threads, e.g. the user interface thread continues, while a printing thread is working in the background.

Q6: Process loading is where a program is copied from backing storage into RAM in order to be available to the processor. This happens, for example, when a user selects an application program by double-clicking on its icon.

Q7: Both the operating system and the application program must be resident in RAM.

Q8: Available memory is divided up into blocks of a fixed size (e.g. 200K). Processes are each allocated to a different partition, to ensure that they do not corrupt each other. Fixed partitioning is inefficient because a process may be rejected because it is too big for a single partition (although there is enough total memory space available), and because there will be unused space in each partition, which if grouped together could provide enough space for another process.

Q9: Each process is allocated a partition of an appropriate size. This prevents wasted space.

Q10: Best fit tends to leave lots of small unused areas of memory, which need to be compacted before they can be used.

Q11: Paging involves splitting large programs up into chunks of the same size, called pages. When a large program is to be loaded, the pages can be allocated to different partitions. This allows a program to be loaded even if there is not a single empty area of memory large enough for it. If the whole program is too large for the memory available, then only the pages required immediately need to be loaded. These can be swapped for other pages when the need arises.

­c HERIOT-WATT UNIVERSITY 2006 228 ANSWERS: TOPIC 6

Q12: Fitting Algorithm Description New process is allocated to the first big First Fit enough space in memory New process is allocated to the smallest Best Fit space into which it will fit New process is allocated to the largest Worst Fit space available

Answers from page 166.

Q13: Ready, blocked, running

Q14: b) It’s waiting for some I/O to finish

Q15: a) Ready

Q16: In cooperative scheduling processes are supposed to take turns at the processor. In pre-emptive scheduling, the operating system retains control and the processes have to take turns.

Q17: a) True

Q18: b) False

Answers from page 170.

Q19: b) False

Q20: b) False

Q21: b) A device controller

Answers from page 173.

Ò Ò Ò Q22: C:ÒB Xml Os os1.xml

Answers from page 178.

Q23:

a) 4 and 5 b) 4 and 5 c) 23 and 24

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 6 229

Answers from page 181.

Q24: letter.doc is stored in 4, 6, 12 and 13 expenses.xls is stored in 8 fun.exe is stored in 5, 1 and 2 note.txt is store in 11, 9 and 10. Clusters 3 and 7 are unused, and the FAT is stored in cluster 0.

Answers from page 183.

Q25:

¯ Application requests FMS to load the file

¯ FMS checks the hard disc directory to locate the file, and check its size

¯ FMS sends message to MMS to allocate a block of RAM for the file

¯ MMS sends message to FMS to say sufficient RAM has been allocated

¯ FMS instructs the hard disc device driver to read the file into RAM

¯ Driver informs FMS that file has been read successfully

¯ FMS informs application that file has been loaded

¯ Application displays file on screen

Answers from page 185.

Q26: b) virus checking

Q27: c) worst fit

Q28: b) gives each process equal times slices in order

Q29: a) is the smallest area of a storage device that can be allocated to a file

Q30: c) a file can only be stored in a series of neighbouring clusters

­c HERIOT-WATT UNIVERSITY 2006 230 ANSWERS: TOPIC 7

7 Operating System Developments Revision (page 189)

Q1: c) MS Word Q2: b) a command line interface Q3: b) graphical user interfaces Q4: d) is a computer that can be used by many users simultaneously

Answers from page 193.

Q5: b) False Q6: Windows, Icons, Menus, Pointer Q7:

¯ greater direct control of the operating system

¯ less demand on processor and memory

¯ can be programmed (using scripting languages)

Answers from page 198.

Q8: Object linking and embedding Q9: An embedded object becomes part of the document and is stored in the same file; a linked object remains independent.

Answers from page 203.

Q10: b) False Q11: a) True Q12: a) True Q13: a) True Q14: a) True Q15: a) True

Answers from page 204.

Q16: a) True Q17: a) True Q18: a) True Q19: a) True

­c HERIOT-WATT UNIVERSITY 2006 ANSWERS: TOPIC 7 231

Answers from page 205.

Q20: b) False

Q21: a) True

Q22: b) False

Answers from page 206.

Q23: a) True

Q24: b) False

Q25: b) False

Answers from page 207.

Q26: a) True

Q27: a) True

Answers from page 208.

Q28: a) True

Q29: a) True

Answers from page 208.

Q30: b) False

Q31: b) False

Answers from page 210.

Q32: b) False

Q33: a) True

Q34: a) True

Q35: a) True

Answers from page 213.

Q36: b) requires the user to type in all commands

­c HERIOT-WATT UNIVERSITY 2006 232 ANSWERS: TOPIC 7

Q37: c) is an open source operating system with a CLI

Q38: d) is a method of allowing applications to exchange data

Q39: a) allow access to files to be controlled in a multi-user system

­c HERIOT-WATT UNIVERSITY 2006