Computer Architecture

Faculty Of And Technology

Second Term 2019- 2020

Dr.Khaled Kh. Sharaf Architecture

Chapter 2:

Computer Evolution and Performance Computer Architecture

LEARNING OBJECTIVES 1. A Brief History of Computers. 2. The Evolution of the x86 Architecture 3. Embedded Systems and the ARM 4. Performance Assessment Computer Architecture

1. A BRIEF HISTORY OF COMPUTERS The First Generation: Vacuum Tubes Electronic Numerical Integrator And Computer (ENIAC) - Designed and constructed at the University of Pennsylvania, was the world’s first general purpose - Started 1943 and finished 1946

- Decimal (not binary) - 20 accumulators of 10 digits

- Programmed manually - 18,000 vacuum tubes

by switches

- 30 tons -15,000 square feet

- 140 kW power consumption - 5,000 additions per second

Computer Architecture

The First Generation: Vacuum Tubes

VON NEUMANN MACHINE • Stored Program concept

• Main memory storing programs and data

• ALU operating on binary data

• Control unit interpreting instructions from memory and executing

• Input and output equipment operated by control unit

• Princeton Institute for Advanced Studies

• IAS

• Completed 1952 Computer Architecture Structure of von Neumann machine

Structure of the IAS Computer Computer Architecture

IAS - details

• 1000 x 40 bit words

• Binary number

• 2 x 20 bit instructions Set of registers (storage in CPU) 1 • Memory buffer register (MBR): Contains a word to be stored in memory or sent to the I/O unit, or is used to receive a word from memory or from the I/O unit. • • Memory address register (MAR): Specifies the address in memory of the word to be written from or read into the MBR. • • Instruction register (IR): Contains the 8-bit opcode instruction being executed. Computer Architecture

IAS - details

Set of registers (storage in CPU) 2 • Instruction buffer register (IBR): Employed to hold temporarily the right hand instruction from a word in memory. • Program counter (PC): Contains the address of the next instruction pair to be fetched from memory. • Accumulator (AC) and multiplier quotient (MQ): Employed to hold temporarily operands and results of ALU operations. Computer Architecture

Commercial Computers

The 1950s saw the birth of the computer industry with two companies, Sperry and IBM, dominating the marketplace The UNIVAC I (Universal Automatic Computer) was the first successful commercial computer. It was intended for both scientific and commercial applications

• Late 1950s - UNIVAC II

• Faster

• More memory Computer Architecture

IBM

• Punched-card processing equipment

• 1953 - the 701

• IBM’s first stored program computer

• Scientific calculations

• 1955 - the 702

• Business applications

• Lead to 700/7000 series Computer Architecture

The Second Generation:

• The second generation saw the introduction of more complex arithmetic and logic units and control units, the • Use of high-level programming languages, and the provision of system with the computer. • system software provided the ability to • load programs, move data to peripherals, and libraries to perform common computations, similar to what modern OSes like Windows and Linux do.

• Literally - “small electronics”

• A computer is made up of gates, memory cells and interconnections

• These can be manufactured on a semiconductor

• e.g. silicon wafer

Computer Architecture

Transistors

• Replaced vacuum tubes

• Smaller

• Cheaper

• Less heat dissipation

• Solid State device

• Made from Silicon (Sand)

• Invented 1947 at Bell Labs

• William Shockley et al. Computer Architecture

Transistor Based Computers

• Second generation machines

• NCR & RCA produced small machines

• IBM 7000 • DEC – 1957 “Digital Equipment Corporation”

• Produced PDP-1

Computer Architecture

Third Generation: Integrated Circuits Microelectronics A single, self-contained transistor is called a discrete component. Throughout the 1950s and early 1960s, electronic equipment was composed largely of discrete components— transistors, resistors, capacitors, and so on.

Microelectronics means, literally, “small electronics.” Since the beginnings of digital electronics and the computer industry, there has been a persistent and consistent trend toward the reduction in size of digital electronic circuits.

Computer Architecture

Relationship among Wafer, Chip, and Gate Computer Architecture Moore’s Law

• Increased density of components on chip

• Gordon Moore – co-founder of Intel

• Number of transistors on a chip will double every year

• Since 1970’s development has slowed a little

• Number of transistors doubles every 18 months

• Cost of a chip has remained almost unchanged

• Higher packing density means shorter electrical paths, giving higher performance

• Smaller size gives increased flexibility

• Reduced power and cooling requirements

• Fewer interconnections increases reliability Computer Architecture consequences of Moore’s law

1. The cost of a chip has remained virtually unchanged during this period of rapid growth in density. This means that the cost of computer logic and memory circuitry has fallen at a dramatic rate. 2. Because logic and memory elements are placed closer together on more densely packed chips, the electrical path length is shortened, increasing operating speed. 3. The computer becomes smaller, making it more convenient to place in a variety of environments. 4. There is a reduction in power and cooling requirements. 5. The interconnections on the are much more reliable than solder connections. With more circuitry on each chip, there are fewer interchip connections Computer Architecture Later Generations • Beyond the third generation there is less general agreement on defining generations of computers. Table 2.2 suggests that there have been a number of later generations, based on advances in integrated circuit technology.

• With the introduction of large-scale integration (LSI), more than 1000 components can be placed on a single integrated circuit chip.

• Very-large-scale integration (VLSI) achieved more than 10,000 components per chip, while current ultra-large-scale integration (ULSI) chips can contain more than one billion components.

• SEMICONDUCTOR MEMORY The first application of integrated circuit technology to computers was construction of the processor (the control unit and the arithmetic and logic unit) out of integrated circuit chips. But it was also found that this same technology could be used to construct memories. Computer Architecture Later Generations • SEMICONDUCTOR MEMORY The first application of integrated circuit technology to computers was construction of the processor (the control unit and the arithmetic and logic unit) out of integrated circuit chips. But it was also found that this same technology could be used to construct memories.

• Since 1970, semiconductor memory has been through 13 generations: 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G, and, as of this writing, 16 Gbits on a single chip (1K = 210, 1M = 220, 1G = 230). Each generation has provided four times the storage density of the previous generation, accompanied by declining cost per bit and declining access time. Computer Architecture Later Generations MICROPROCESSORS Just as the density of elements on memory chips has continued to rise, so has the density of elements on processor chips. As time went on, more and more elements were placed on each chip, so that fewer and fewer chips were needed to construct a single computer processor. Computer Architecture Generations of Computer

• Vacuum tube - 1946-1957

• Transistor - 1958-1964

• Small scale integration - 1965

• Up to 100 devices on a chip

• Medium scale integration - to 1971

• 100-3,000 devices on a chip

• Large scale integration - 1971-1977

• 3,000 - 100,000 devices on a chip

• Very large scale integration - 1978 -1991

• 100,000 - 100,000,000 devices on a chip

• Ultra large scale integration – 1991 -

• Over 100,000,000 devices on a chip Computer Architecture Generations of Computer Computer Architecture x86 Evolution (1) • 8080 • first general purpose microprocessor • 8 bit data path • Used in first personal computer – Altair • 8086 – 5MHz – 29,000 transistors • much more powerful • 16 bit • instruction cache, prefetch few instructions • 8088 (8 bit external ) used in first IBM PC • 80286 • 16 Mbyte memory addressable • up from 1Mb • 80386 • 32 bit • Support for multitasking • 80486 • sophisticated powerful cache and instruction pipelining • built in maths co-processor Computer Architecture x86 Evolution (2)

• Pentium • Superscalar • Multiple instructions executed in parallel • Pentium Pro • Increased superscalar organization • Aggressive • branch prediction • data flow analysis • • Pentium II • MMX technology • graphics, video & audio processing • Pentium III • Additional floating point instructions for 3D graphics Computer Architecture x86 Evolution (3)

• Pentium 4 • Note Arabic rather than Roman numerals • Further floating point and multimedia enhancements • Core • First x86 with dual core • Core 2 • 64 bit architecture • Core 2 Quad – 3GHz – 820 million transistors • Four processors on chip

• x86 architecture dominant outside embedded systems • Organization and technology changed dramatically • Instruction set architecture evolved with backwards compatibility • ~1 instruction per month added • 500 instructions available • See Intel web pages for detailed information on processors Computer Architecture

Embedded Systems ARM

. A combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function. In many cases, embedded systems are part of a larger system or product, as in the case of • an antilock. • braking system in a car.

• ARM evolved from RISC design

• Used mainly in embedded systems

• Used within product

• Not general purpose computer

• Dedicated function

Computer Architecture

Embedded Systems Requirements

• Different sizes

• Different constraints, optimization, reuse

• Different requirements

• Safety, reliability, real-time, flexibility, legislation

• Lifespan

• Environmental conditions

• Static v dynamic loads

• Slow to fast speeds

• Computation v I/O intensive

• Descrete event v continuous dynamics Computer Architecture

Possible Organization of an Embedded System Computer Architecture

ARM Evolution

• Designed by ARM Inc., Cambridge, England

• Licensed to manufacturers

• High speed, small die, low power consumption

• PDAs, hand held games, phones

• E.g. iPod, iPhone

• Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989

• Acorn, VLSI and Apple Computer founded ARM Ltd.

Computer Architecture

ARM Systems Categories

• Embedded real time

• Application platform

• Linux, Palm OS, Symbian OS, Windows mobile

• Secure applications Computer Architecture

What do we measure?

Define performance…. Computer Architecture Define performance….

When trying to choose among different computers, performance is an important attribute. Accurately measuring and comparing different computers is critical to purchasers and therefore to designers.

Airplane Passengers Range (mi) Speed (mph) Boeing 737-100 101 630 598 Boeing 747 470 4150 610 BAC/Sud Concorde 132 4000 1350 Douglas DC-8-50 146 8720 544

• How much faster is the Concorde compared to the 747?

• How much bigger is the Boeing 747 than the Douglas DC-8?

• So which of these airplanes has the best performance?! Computer Architecture

Defining Performance we can define computer performance in several different ways.

• If you were running a program on two different desktop computers, you’d say that the faster one is the desktop computer that gets the job done first.

• If you were running a datacenter that had several servers running jobs submitted by many users, you’d say that the faster computer was the one that completed the most jobs during a day.

Computer Architecture

Defining Performance: TIME, TIME, TIME!!!

• Response Time (elapsed time, ,): • how long does it take for my job to run? Individual user • how long does it take to execute (start to concerns… finish) my job? • how long must I wait for the database query? • : • how many jobs can the machine run at once? Systems manager • what is the average execution rate? concerns… • how much work is getting done?

• If we upgrade a machine with a new processor what do we increase? • If we add a new machine to the lab what do we increase? Computer Architecture

Defining Performance

If we upgrade a machine with a new processor what do we increase? - both response time and throughput are improved.

If we add a new machine to the lab what do we increase? - case 2, no one task gets work done faster, so only throughput increases.

Thus, in many real computer systems, changing either execution time or throughput often affects the other. Computer Architecture Execution Time

• Elapsed Time

• counts everything (disk and memory accesses, waiting for I/O, running other programs, etc.) from start to finish

• a useful number, but often not good for comparison purposes elapsed time = CPU time + wait time (I/O, other programs, etc.)

• CPU time

• doesn't count waiting for I/O or time spent running other programs

• can be divided into user CPU time and system CPU time (OS calls) CPU time = user CPU time + system CPU time  elapsed time = user CPU time + system CPU time + wait time

• Our focus: user CPU time (CPU execution time or, simply, execution time)

• time spent executing the lines of code that are in our program Computer Architecture

Execution Time

• For some program running on machine X:

PerformanceX = 1 / Execution timeX

• X is n times faster than Y means:

PerformanceX / PerformanceY = n

• execution time on Y is n times longer than it is on X:

Execution timey / Execution timeX = n Computer Architecture

Execution Time Relative Performance If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?

We know that A is n times faster than B if

Execution timeB / Execution timeA = n

Thus the performance ratio is 15 / 10 = 1.5 A is therefore 1.5 times faster than B. Computer Architecture Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles. In modern computers hardware events progress cycle by cycle: in other words, each event, e.g., multiplication, addition, etc., is a sequence of cycles

seconds cycles seconds   program program cycle • Clock ticks indicate start and end of cycles: clock period: The length of

cycle

time tick

tick each clock cycle. • cycle time = time between ticks = seconds per cycle

(frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1 MHz. = 106 cycles/sec) 1 • Example: A 200 Mhz. clock has a  10 9  5 nanoseconds 6 200  10 cycle time Computer Architecture

Measuring Performance Time is the measure of computer performance: the computer that performs the same amount of work in the least time is the fastest.

Program execution time is measured in seconds per program. The time can be defined in different ways, depending on what we count wall clock time, response time, or elapsed time: These terms mean the total time to complete a task, including: - disk accesses, - memory accesses, - input/output (I/O) activities, - operating system overhead—everything. Computer Architecture

Measuring Performance

In such cases, the system may try to optimize throughput rather than attempt to minimize the elapsed time for one program.

CPU execution time or simply CPU time is the actual time the CPU spends for a specific task and does not include time spent waiting for I/O or running other programs. user CPU time is the CPU time spent in a program itself. system CPU time is the CPU time spent in the operating system performing tasks on behalf of the program. Computer Architecture

Measuring Performance

Because it is often hard to assign responsibility for operating system activities to one user program rather than another, And because of the functionality differences among operating systems.

The differentiating between system and user CPU time is difficult to do accurately.

For consistency, we maintain a distinction between performance based on elapsed time and that based on CPU execution time.

We will use the term system performance to refer to elapsed time on an unloaded system and CPU performance to refer to user CPU time. Computer Architecture

Understanding Program Performance Different applications are sensitive to different aspects of the performance of a computer system.

Many applications, especially those running on servers, depend as much on I/O performance, which, in turn, relies on both hardware and software.

Total elapsed time measured by a wall clock is the measurement of interest.

In some application environments, the user may care about throughput, response time, or a complex combination. Computer Architecture Measuring Performance

To improve the performance of a program, one must have a clear definition of what performance metric.

Almost all computers are constructed using a clock that determines when events take place in the hardware. These discrete time intervals are called clock cycles.

clock cycle is the time for one clock period, usually of the processor clock, which runs at a constant rate.

clock period: The length of each clock cycle.

clock rate: The speed at which the processor execute instruction. Computer Architecture Performance Equation I

seconds cycles seconds   program program cycle

equivalently

CPU execution time = CPU clock cycles  Clock cycle time for a program for a program

• So, to improve performance one can either: • reduce the number of cycles for a program, or • reduce the clock cycle time, or, equivalently, • increase the clock rate Computer Architecture

CPU Performance and Its Factors

Alternatively, because clock rate and clock cycle time are inverses,

CPU execution time for a program = CPU clock cycles for a program Clock rate Computer Architecture How many cycles are required for a program?

• Could assume that # of cycles = # of instructions

time

st st instruction nd instruction instruction rd th

th th

...

1 2 3 4

5 6

 This assumption is incorrect! Because:

 Different instructions take different amounts of time (cycles)

 Why…? Computer Architecture How many cycles are required for a program?

time

• Multiplication takes more time than addition

• Floating point operations take longer than integer ones

• Accessing memory takes more time than accessing registers

• Important point: changing the cycle time often changes the number of cycles required for various instructions because it means changing the hardware design. More later… Computer Architecture

Example

• Our favorite program runs in 10 seconds on computer A, which has a 2Ghz. clock.

• We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program.

• What clock rate should we tell the designer to target?

Computer Architecture

CPU Performance and Its Factors Computer Architecture

Terminology

• A given program will require: some number of instructions (machine instructions) some number of cycles some number of seconds

• We have a vocabulary that relates these quantities:

• cycle time (seconds per cycle)

• clock rate (cycles per second)

• (average) CPI ()

• a floating point intensive application might have a higher average CPI

• MIPS (millions of )

• this would be higher for a program using simple instructions Computer Architecture

Performance Measure

• Performance is determined by execution time

• Do any of these other variables equal performance? • # of cycles to execute program? • # of instructions in program? • # of cycles per second? • average # of cycles per instruction? • average # of instructions per second?

• Common pitfall : thinking one of the variables is indicative of performance when it really isn’t Computer Architecture

Instruction Performance

Therefore, the number of clock cycles required for a program can be written as

CPU clock cycles = Instructions for a program × Average clock cycles per instruction

The term clock cycles per instruction, which is the average number of clock cycles each instruction takes to execute, is often abbreviated as CPI. clock cycles per instruction (CPI) Average number of clock cycles per instruction for a program or program fragment. Computer Architecture

Instruction Performance

Suppose we have two implementations of the same instruction set architecture.

Computer A has a clock cycle time of 250 ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same program.

Which computer is faster for this program and by how much? Computer Architecture

Instruction Performance We know that each computer executes the same number of instructions for the program; let’s call this number I. First, find the number of processor clock cycles for each computer:

We can conclude that computer A is 1.2 times as fast as computer B for this program. Computer Architecture

Performance Equation II

We can now write performance equation ii in terms of instruction count, CPI, and clock cycle time:

CPU execution time = Instruction count x average CPI x cycle time for a program for a program

or, since the clock rate is the inverse of clock cycle time:

CPU execution time = Instruction count for a program × CPI for a program Clock rate Computer Architecture

The Classic CPU Performance Equation

A compiler designer is trying to decide between two code sequences for a particular computer. The hardware designers have supplied the following facts:

For a particular high-level language statement, the compiler writer is considering two code sequences that require the following instruction counts:

Computer Architecture

The Classic CPU Performance Equation

Which code sequence executes the most instructions?

Which will be faster?

What is the CPI for each sequence? Computer Architecture

The Classic CPU Performance Equation ANSWER

Sequence 1 executes 2 + 1 + 2 = 5 instructions.

Sequence 2 executes 4 + 1 + 1 = 6 instructions.

Therefore, sequence 1 executes fewer instructions.

We can use the equation for CPU clock cycles based on instruction count and CPI to find the

total number of clock cycles for each sequence: Computer Architecture

The Classic CPU Performance Equation

This yields CPU clock cycles1 = (2 × 1) + (1 × 2) + (2 × 3) = 2 + 2 + 6 = 10 cycles

CPU clock cycles2 = (4 × 1) + (1 × 2) + (1 × 3) = 4 + 2 + 3 = 9 cycles

So code sequence 2 is faster, even though it executes one extra instruction. Since code sequence 2 takes fewer overall clock cycles but has more instructions, it must have a lower CPI.

Computer Architecture

The Classic CPU Performance Equation

The CPI values can be computed by Computer Architecture

The Classic CPU Performance Equation

The following figure shows the basic measurements at different levels in the computer and what is being measured in each case.

We can see how these factors are combined to yield execution time measured in seconds per program: Computer Architecture The Classic CPU Performance Equation The following table summarizes how these algorithm, the language, the compiler, the architecture, and the actual hardware affect the factors in the CPU performance equation. Computer Architecture

Finally

I wish you good luck