Organization and Architecture Home PC - 1954

Chapter 2 Computer Evolution and Performance

Internet Hoax Ken Olsen’s Famous Remark

• This photo was entered in a photo doctoring • Ken Olsen, President of Digital Equipment contest (before Adobe Photoshop was created) Corp, 1977: “there is no reason for any • The “computer” is actually part of a submarine individual to have a computer in his home.” control console • Remark was taken out of context – he was referring to the computer of science fiction: "...the fully computerized home that automatically turned lights on and off and that prepared meals and controlled daily diets ..."

The Turing Bombe ENIAC - background

• Electronic Numerical Integrator And Computer • Eckert and Mauchly, University of Pennsylvania • Designed to calculate trajectory tables for weapons • Started 1943 • Finished 1946 — Too late for war effort • Used until 1955

Courtesy kentishman at http://www.flickr.com/photos/kentish/502414125/sizes/z/in/photostream/

1 ENIAC Programming in the ’40s

ENIAC - details von Neumann/Turing

• Decimal (not binary) • Stored Program concept • 20 accumulators of 10 digits • Features — Each accumulator had 100 vacuum tubes — Data and instructions (programs) are stored in a single read/write memory — One decimal digit was represented by a ring of 10 — Memory contents are addressable by location, tubes; only one “on” at a time regardless of the content itself • Programmed manually by — Sequential execution • 18,000 vacuum tubes — ALU operating on binary data — interpreting instructions from memory • 30 tons and executing • 15,000 square feet — Input and output equipment operated by control • 140 kW power consumption unit • Princeton Institute for Advanced Studies (IAS) • 5,000 additions per second • Completed 1952

Paraphrased from von Neumann’s paper: Von Neumann (cont’d)

• 1. Since the device is primarily a computer it • 4. The device must have organs to transfer will have to perform the elementary operations information from the outside recording medium of arithmetic most frequently. Therefore, it of the device into the central arithmetic part should contain specialized organs for just these and central control part, and the memory. operations, i.e., addition, subtraction, These organs form its input. multiplication, and division. • 5. The device must have organs to transfer • 2. The logical control of the device (i.e., information from the central arithmetic part proper sequencing of its operations) can best and central control part, and the memory into be carried out by a central control organ. the outside recording medium. These organs • 3. A device which is to carry out long and form its output. complicated sequences of operation must have a considerable memory capacity.

2 IAS Machine Structure of von Neumann (IAS) machine

• This architecture came to be known as the “von Neumann” architecture and has been the basis for virtually every machine designed since then

Structure of IAS – IAS - details detail

• 1000 x 40 bit words — — 2 x 20 bit instructions • Set of registers (storage in CPU) — — Instruction Register — Instruction Buffer Register — Program — Accumulator — Multiplier Quotient

IAS Instruction Set IAS Operation

3 Commercial IBM

• 1947 - Eckert-Mauchly Computer Corp • The major manufacturer of punched-card • UNIVAC I (Universal Automatic Computer) processing equipment — Commissioned by US Bureau of Census 1950 • 1953 - the 701 calculations — IBM’s first stored program computer — First successful commercial computer — Scientific calculations • Became part of Sperry-Rand Corporation • 1955 - the 702 • Late 1950s - UNIVAC II — Business applications — Faster and with more memory than Univac I • Lead to 700/7000 series — Backwards compatible • Established IBM dominance • Developed 1100 series for scientific computation • Scientific/text processing split

Transistors: 2nd generation Transistor Based Computers

• Replaced vacuum tubes • Second generation machines • Smaller, cheaper, less heat dissipation • NCR & RCA produced small transistor machines • Solid State device, made from silicon (sand) • IBM 7000 • Invented 1947 at Bell Labs by William Shockley • DEC - 1957 et al. — Produced PDP-1 • Used in computers in the late 1950’s — First in series of PDP machines that marked a radical shift in philosophy of computer design: focus • Defined the “second generation” of computers interaction with user rather than efficient use of computer cycles

Hacker culture The IBM 7094

• “Hacker” used to mean someone who loved to • Last of the 7000 series play with computers • Illustrates evolution of a computer family • DEC PDP computers led to the development of hacker cultures at MIT, BBN and elsewhere • Developed music programs and the first computer game Spacewar! • The hacker culture lives on in the open source movement • See http://www.computerhistory.org/pdp-1/index.php

4 An IBM 7094 Configuration Integrated Circuits; 3rd Generation

• Transistors are “discrete components” • Manufactured in own container; soldered to circuit board • Early 2nd generation: 10,000 discrete transistors • Later computers several hundred thousand • 1958: Integrated circuits (ICs) — Many transistors on piece of silicon — Defines 3rd generation of computers • Two most important 3rd generation computers were the IBM System 360 and the DEC PDP-8

How many transistors on one chip? Computer Components

• Intel: see • Only two basic components are needed to http://www.intel.com/pressroom/kits/quickreffam.htm create a computer: gates and memory cells • 8086: 29,000 • Each component is composed of multiple • 80386: 275,000 transistors • Original Pentium: 3.1 million • Pentium IV: up to 169 million • Itanium: up to 1.72 billion

Four basic functions Chip manufacturing

• Data Storage: memory cells • Data Processing: gates • Data Movement: connections (paths) between gates and memory cells • Control: control signals activate gates and memory and determine whether memory access is read or write • Gates, memory cells and interconnections can be etched into a single piece of silicon

5 Moore’s Law Generations of Computers

• Gordon Moore – co-founder of Intel • - 1946-1957 • Predicted that the number of transistors on a • Transistor - 1958-1964 chip will double every year • Small scale integration - 1965 on • Since 1970’s development has slowed a little — Up to 100 devices on a chip — Number of transistors doubles every 18 months • Medium scale integration - to 1971 — 100-3,000 devices on a chip • Cost of a chip has remained almost unchanged • Large scale integration - 1971-1977 • Higher packing density means shorter electrical — 3,000 - 100,000 devices on a chip paths, giving higher performance • Very large scale integration - 1978 -1991 • Smaller size gives increased flexibility — 100,000 - 100,000,000 devices on a chip • Reduced power and cooling requirements • Ultra large scale integration – 1991 - • Fewer interconnections increases reliability — Over 100,000,000 devices on a chip

Growth in CPU IBM 360 series

• Introduced in 1964 • Replaced (& not compatible with) 7000 series • First planned “family” of computers — Similar or identical instruction sets — Similar or identical O/S — Increasing speed — Increasing number of I/O ports (i.e. more terminals) — Increased memory size — Increased cost of higher family members

The System 360 Family System 360

• Differences were: — Complexity of ALU circuitry allowing parallel execution — Increased width allowing faster data transfer • The 360 family gave IBM a huge market share • Architecture still used for IBM mainframes

6 DEC PDP-8 DEC - PDP-8 Bus Structure

• 1964 • First minicomputer (named after the miniskirt!) • Did not need air conditioned room • Small enough to sit on a lab bench • 96 separate signal paths carry address, data • $16,000 and control signals — vs.$100k+ for IBM 360 • Easy to plug new modules into bus • Embedded applications & OEM • Bus is controlled by CPU • Introduced the BUS structure (vs. the central- • Now virtually universal in microcomputers switched IBM architecture)

Beyond 3rd Generation Magnetic Core Memory

• Classification by generation became less • Each bit stored in a small iron ring (about meaningful after the early 1970’s due to rapid 1/16” diameter) pace of technological development • Mostly assembled by hand • Two main developments are still being worked • Reads are destructive out: — Semiconductor memory • See http://www.columbia.edu/acis/history/core.html —

Semiconductor Memory Intel Microprocessors

• First developed in 1970 by Fairchild • 1971 - 4004 • Size of a single core — First — i.e. 1 bit of magnetic core storage — All CPU components on a single chip • Held 256 bits — 4 bit • Non-destructive read • Followed in 1972 by 8008 — 8 bit • Much faster than core — Both designed for specific applications • 1974: became cheaper than core memory • 1974 - 8080 • Capacity approximately doubles each year — Intel’s first general purpose microprocessor • 2GB chips now available

7 16 and 32 bit processors Intel Processors 1971 - 1989

• 16-bit processors were introduced in the late 1970’s • First 32-processors were introduced in 1981 (Bell Labs and Hewlett-Packard) • Intel’s 32-bit did not arrive until 1985 – and we’re still using that architecture

Intel x86 Processors 1990 - 2002 Processing Power

• Today’s $400 personal computer has more computing power than a 1990 mainframe • Much of this is gobbled up by bloated operating systems (e.g., Vista) • But many PC applications are only possible with great processing capability: — Image processing — Video Editing — Speech recognition — Simulation modeling — …

Speeding it up Pipelining

• Pipelining • Like an assembly line • On board • Instruction execution is divided into stages • On board L1 & L2 cache • When one stage has completed, that circuitry is • Branch prediction free to operate on the “next” instruction • Data flow analysis •

8 Cache Memory Branch Prediction

• Very fast memory stores copies of data in slow • Processor looks ahead in instruction stream and main memory attempts to predict which branch will be taken • Can be onboard (same chip as processor) or in by conditional code separate chips • Execute predicted branch and hope it’s right • Wrong guess leads to

Data Flow Analysis Speculative Execution

• Processor examines code to determine which • With branch prediction and data flow analysis instructions are dependent on results of processor can execute some instructions before instructions in pipeline they appear in the execution stream • Reorders instructions in code so that • Results stored in temporary locations independent instructions can be executed in parallel pipelines

Performance Balance Logic and Memory Performance Gap

• Processor speed has increased • Memory capacity has increased • BUT: Memory speed lags behind processor speed

9 Solutions I/O Devices

• Increase number of bits retrieved at one time • Peripherals with intensive I/O demands — Make DRAM “wider” rather than “deeper” • Large data throughput demands • Change DRAM interface • Processors can handle this — Cache • But there are problems moving data between • Reduce frequency of memory access processor and peripheral — More complex cache and cache on chip • Solutions: • Increase interconnection bandwidth — Caching — High speed buses — Buffering — Hierarchy of buses — Higher-speed interconnection buses — More elaborate bus structures — Multiple-processor configurations

Typical I/O Device Data Rates Key is Balance

• Processor components • Main memory • I/O devices • Interconnection structures

Improvements in Chip Organization and Architecture Problems with Clock Speed and Logic Density

• Increase hardware speed of processor • Power — Fundamentally due to shrinking size — Power density increases with density of logic and clock speed — Dissipating heat – More gates, packed more tightly, increasing – Propagation time for signals reduced • RC delay — Speed at which electrons flow limited by resistance and • Increase size and speed of caches capacitance of metal wires connecting them — Dedicating part of processor chip — Delay increases as RC product increases – Cache access times drop significantly — Wire interconnects thinner, increasing resistance • Change processor organization and architecture — Wires closer together, increasing capacitance — Increase effective speed of execution • Memory latency — Memory speeds lag processor speeds — Parallelism • Solution: — More emphasis on organizational and architectural approaches

10 Intel Microprocessor Performance Increased Cache Capacity

• Typically two or three levels of cache between processor and main memory • Chip density increased — More cache memory on chip – Faster cache access • Pentium chip devoted about 10% of chip area to cache • Pentium 4 devotes about 50%

More Complex Execution Logic Diminishing Returns

• Enable parallel execution of instructions • Internal organization of processors complex • Pipeline works like assembly line — Can get a great deal of parallelism — Different stages of execution of different — Further significant increases likely to be relatively instructions at same time along pipeline modest • Superscalar processing allows multiple • Benefits from cache are reaching limit pipelines within single processor • Increasing clock rate runs into power — Instructions that do not depend on one another can dissipation problem be executed in parallel — Some fundamental physical limits are being reached

New Approach – Multiple Cores POWER4 Chip Organization

• Multiple processors on single chip — Large shared cache • Within a processor, increase in performance proportional to square root of increase in complexity • If software can use multiple processors, doubling number of processors almost doubles performance • So, use two simpler processors on the chip rather than one more complex processor • With two processors, larger caches are justified — Power consumption of memory logic less than processing logic • Example: IBM POWER4 — Two cores based on PowerPC

11 Pentium Evolution (1) Pentium Evolution (2)

• 8080 • 80486 — first general purpose microprocessor — sophisticated powerful cache and instruction — 8 bit data bus pipelining — Used in first personal computer – Altair — built in floating point co-processor • 8086 — much more powerful • Pentium — 16 bit processor and bus — Superscalar — instruction cache, prefetch few instructions — Multiple instructions executed in parallel — 8088 (8 bit external bus) used in first IBM PC • Pentium Pro • 80286 — Increased superscalar organization — 16 Mbyte memory addressable — up from 1Mb — Aggressive • 80386 — branch prediction — 32 bit — data flow analysis — Support for multitasking — speculative execution

Pentium Evolution (3) Embedded Systems

• Pentium II • Refers to the use of electronics and software — MMX technology within a product or device, designed to — graphics, video & audio processing • Pentium III perform a specific function — Additional floating point instructions for 3D graphics • Examples include automotive systems (ignition, • Pentium 4 engine control, braking); consumer electronics — Note Arabic rather than Roman numerals and household appliances; industrial control; — Further floating point and multimedia enhancements • Itanium medical devices; office automation — 64 bit • There are far more embedded systems in — see chapter 15 existence than general purpose computers • Itanium 2 — Hardware enhancements to increase speed • See Intel web pages for detailed information on processors

Embedded Systems Constraints on Embedded Systems

• Requirements and constraints vary — Small systems may have radically different cost constraints (pennies) from large systems (millions of dollars) — Stringent to non-existent quality requirements (safety, reliability, reliability, legislated requirements) — Short (single use) to long (decades) lifetimes — Resistance to environmental conditions such as moisture, pressure, temperature, humidity, radiation

12 Constraints on Embedded Systems Environmental Coupling

— Different application characteristics depending on • Embedded systems are typically tightly coupled static or dynamic loads, sensor speed, response time with their environment requirements, computation versus I/O intensive tasks • Most systems are subject to real-time response — Different models of computation ranging from restraints generated by interaction with the discrete asynchronous event systems to systems environment with continuous time dynamics — speed of motion, measurement precision, reactive latency dictate software requirements

Possible Organization ARM Microprocessors

• ARM is a family of RISC microprocessors and designed by ARM Inc — the company designs architectures and licenses them to manufacturers • ARM processors are widely used in PDAs, games, phones and consumer electronics, including iPod and iPhone

ARM Evolution ARM Evolution

Family Notable Features Cache Typical MIPS @ • Early 1980s: Acorn Inc contracted with BBC to MHz ARM1 32-bit RISC None ARM2 Multiply and swap None 7 MIPS @ 12 MHz develop a new microcomputer architecture instructions; Integrated , • Acorn RISC Machine (ARM1) released 1985 graphics and I/O processor — VLSI Technology was the silicon partner ARM3 First use of 4 KB unified 12 MIPS @ 25 processor cache MHz ARM6 First to support 32- 4 KB unified 28 MIPS @ 33 — High performance with low power consumption bit addresses; MHz floating-point unit — ARM2 added multiply and swap instructions ARM7 Integrated SoC 8 KB unified 60 MIPS @ 60 MHz ARM8 5-stage pipeline; 8 KB unified 84 MIPS @ 72 — ARM3 added a small cache static branch MHz prediction ARM9 16 KB/16 KB 300 MIPS @ 300 • Acorn, VLSI and Apple started ARM Ltd: MHz ARM9E Enhanced DSP 16 KB/16 KB 220 MIPS @ 200 Advanced RISC Machine: ARM6 instructions MHz ARM10E 6-stage pipeline 32 KB/32 KB ARM11 9-stage pipeline Variable 740 MIPS @ 665 MHz Cortex 13-stage superscalar Variable 2000 MIPS @ 1 pipeline GHz XScale Applications 32 KB/32 KB L1 1000 MIPS @ 1.25 processor; 7-stage 512 KB L2 GHz pipeline

DSP =

13 Processor Assessment Performance Assessment

• What criteria do we use to assess processors? • Performance is not easy compare; it varies with — Cost — Processor Speed — Size — Instruction Set — Reliability — Application — Security — Choice of programming language — Backwards compatibility — Quality of compiler — Power consumption — Skill of the programmer — Performance

Processor Speed Clock speed

• To measure processor speed, we can consider • The system clock provides a series of 1/0 — Clock speed pulses (square wave) that drives circuitry in the — Millions of (MIPS) processor — Millions of floating point instructions per second • Changes in state of circuitry are synchronized (MFLOPS) with the signal changes (1 to 0 or 0 to 1) • A quartz crystal is used to generate a sine wave • Analog to Digital (A/D) converter turns this into a square wave • One pulse is a clock cycle or clock tick — A 1GHz clock produces one billion ticks per second — But a machine with a 2GHz is not necessarily twice as fast

Clock Instruction Execution

• Machine instructions are operations such as load from memory, add, store to memory • Instruction execution is a set of discrete steps, such as fetch, decode, fetch operands, execute, write operands — Some instructions may take only a few clock cycles while other require dozens — When pipelining is used several instructions may be executed in parallel • So we cannot simply compare clock speeds

14 Instruction Execution Rate Performance Factors and System Attributes

• If all instructions required the same number of clock ticks then the average execution rate would be a constant dependent on clock speed • In addition to variation in instruction clocks, we also have to take into account the time needed to read and write memory Where I = Instruction Count — This can be significantly slower than the system c p = processor cycles needed to decode / execute clock m = number of memory references needed k = ratio between processor cycle and memory cycle = 1 / processor cycle (clock time)

Total time T = Ic * [p + (m * k)] * 

Instructions per Second MFLOPS

• Commonly measured in millions • For heavy-duty “number crunching” 6 applications the number of floating point MIPS = Ic / (T * 10 ) • Example: program executes 2 million operations per second is of more interest than instructions on 400MHz processor. Program has MIPS four major types of instructions • MFLOPS = Millions of Floating Point Operations Instruction Type CPI Mix / Second Arithmetic and logic 1 60% • Important for scientific applications and games Load/Store with cache hit 2 18% Branch 4 12% Load/Store with cache miss 8 10%

Avg Cycles/Instruction = 0.6 + 0.36 + 0.48 + .8 = 2.24 MIPS = (400 * 106 ) / (2.24 * 106 ) = 178

Instruction Set Differences Benchmarks

• MIPS and MFLOPS have limited utility as • A program designed to test timing for a particular performance assessments type of application • Consider A = B + C in a high-level language (HLL) — Written in a high level language for portability • Might be translated as one instruction (CISC) — Representative of a particular programming style add mem(B), mem(C), mem(A) (number crunching, business database, systems, …) • Or three (Intel) — Designed for easily measured timing mov eax, B • Standard Performance Evaluation Corp (SPEC) has add eax, C produced a widely recognized set of benchmarks mov A, eax — Measurements across a variety of benchmarks are • Or four (typical RISC) averaged using a geometric mean load R1, B load R2, C • See http://www.spec.org/benchmarks.html add R3, R2, R1 store A, R3

15 A note on measures of central tendencies Amdahl’s Law

• The conventional arithmetic average is only one • Proposed by Gene Amdahl in 1967 to predict of many measures of central tendencies. Three the maximum speedup of a system using classical or Pythagorean means are (for n multiple processors compared to a single observations): processor — Arithmetic mean (sum of the observations divided by n) • For any give program some fraction f involves — Geometric mean (nth root of the product of the code that is amenable to parallel execution observations) while the remainder of the code (1-f) can only — Harmonic mean (n divided by the sum of the be executed in a serial fashion reciprocal of the observations) • In its simplest form Amdahl’s law for N • It is always true that the arithmetic mean > geometric mean > harmonic mean processors can be expressed as • Reasons for selection of a particular measure are Speedup = 1 / [(1-f) + f / N ) beyond the scope of this course.

Amdahl’s Law Gustafson’s Law

• This leads the rather pessimistic chart below • John Gustafson argued that in practice the for 1,024 processors problem size is not independent of the number • The very steep curve near the Y axis suggests of processors and that massively parallel that very few programs will achieve a speedup systems can achieve near optimal speedup with proportional to the number of processors bounded problem size • Note that both arguments can be applied to improvements other than parallelism; e.g., computation of floating point arithmetic in hardware rather than software

16