<<

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HISTORY OF EE 2310

• Initially planned by Prof. David Harper as a counterpart to courses on computer organization and design at Berkeley and Stanford  D. Patterson (Berkeley) and J. Hennessy (Stanford) are the academic originators of RISC architecture • Required course, prerequisite to EE 4320 • Taught in 1994–95 as EE 2399 by lecturers  Implementation did not meet the expectations of EE faculty • Taught 1996–present by Prof. Cantrell, D. Hollenbeck and Dr. Dodge • Background material added to make Patterson-Hennessy textbook ac- cessible to students with zero hardware or software experience:  Digital kindergarten (gates, few-gate circuits, Boolean, Karnaugh)  Basic CS (overview of processes, data structures)  MANY comments on evaluation forms call for a laboratory ◦ EE 2V99 taught in Spring 1999 c C. D. Cantrell (08/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EE2310 OVERVIEW

• Course organization  Professor Cantrell (email: [email protected]) ◦ Voice phone: (972) 883-2868 ◦ Office: EC 2.302 ◦ Office hours: Saturdays 10 AM – noon  Dr. Dodge (email: [email protected]) ◦ Voice phone: (972) 883-2951 ◦ Office: EC 2.926 ◦ Office hours: 4–5 PM Tuesdays and Thursdays TAs:ArturoGarcia, Sabrina Zaman ◦ Office: EC 2.908 ◦Officehours:Monday through Thursday, 3:00–4:15PM  Provisional grading algorithm (subject to change): course % grade = 0.10(homework %) + 0.225(midterm 1 %) +0.225(midterm 2 %) + 0.45(final exam %) c C. D. Cantrell (08/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

GOALS FOR WEEK 1

•GetaccesstotheWorldWideWeb  Ways to get on the Web: ◦ PC or Mac via modem, cable modem or DSL ◦ PC or Mac in UTD Microcomputer Lab ◦ Internet-connected Unix machine from work or UTD  EE 2310 Home Page: http://www.utdallas.edu/~cantrell/ee2310/ • Acquire books:  Hennessy and Patterson, Computer Organization and Design, Second Edition, Morgan Kaufmann Publishers  John Waldron, Introduction to RISC Assembly Language Program- ming, Addison-Wesley  Capilano Computing Systems, LogicWorks 4, Addison-Wesley  Roger Tokheim, Digital Principles, Third Edition, McGraw-Hill

c C. D. Cantrell (08/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WHAT DO EEs DO?

DESIGN . Systems — Computers, wireless phones, beroptic networks, ... . Circuits — Logic circuits, transmission lines, ampliers, ... . Devices — Transistors, antennas, ... . Software — Signal processing, navigation, simulation, ... ANALYZE (one must understand in order to design) . Systems — Performance, signal-to-noise ratio, ... . Devices — Semiconductors, antennas, waveguides, ... TEACH . Courses . Individual instruction — M.S., Ph.D. dissertation research

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COURSE GOALS

Acquire the ability to use a hierarchical approach to understand a complex system . Basic Electrical Engineering approach . Valid for hardware and software . Facilitates troubleshooting . Essential on the job Become suciently well acquainted with the principles of computer architecture to be able to make intelligent use of computers for de- signing and simulating engineering systems, components, and devices Acquire a basic knowledge of assembly language . Helps one to accomplish the rst two goals . Useful if one works with embedded systems

c C. D. Cantrell (04/1999) VIEWS OF A 74135 DEVICE AT THREE DIFFERENT HIERARCHICAL LAYERS

IC LEVEL GATE LEVEL

PIN1 PIN1 1 135 1A 3 PIN1 2 1Y 1B PIN2 PIN2 PIN2 PIN6 4 PIN5 1C,2C PIN3 PIN3 5 2A 7 PIN3 PIN7 6 2Y 2B PIN4 PIN4 PIN4 PIN5 PIN5

PIN6 PIN6

PIN7 PIN7

GATE LEVEL TRANSISTOR LEVEL b

b c a b

aa

b b THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FALLACIES

“I don’t need to understand computers in order to use them.” . Nothing works all the time. When it breaks, it’s up to you: to diagnose the problem (probably) to x it “I’m a CS major. I don’t need to know about circuits.” “I’m majoring in EE. I don’t need to know about data structures.” . The boundary between hardware and software is fuzzy Field-programmable gate arrays (FPGAs): Software-programmable computer architecture! Computational modeling: Choice of algorithm and program design depend on computer architecture Cache size/design aects loop order Parallelizability of algorithms/programs aects choice & design of hardware c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WHAT IS “COMPUTER ARCHITECTURE”?

Applications

Compiler O/S Kernel

Instruction Set Architecture

Functional Memory I/O System Units

Logic Gates (Digital Design)

Devices (Circuit Design)

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHRINK-WRAPPED SOFTWARE

• Shrink-wrapped software is a compiled program (or suite of pro- grams) packaged in a box that is sealed inside a plastic envelope that shrinks to fit the box when it is heated • Three things make shrink-wrapped software usable on computers that are manufactured by different vendors: Different vendors compile their software for the same instruction set architecture Different vendors compile their software for the same operating system Input/output hardware that conforms to a common set of standards is available on different systems

c C. D. Cantrell (08/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE MAIN COMPONENTS OF A COMPUTER

Computer

Processor Memory Peripheral Devices

Control Input

Datapath Output

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL APPROACH TO WRITING SOFTWARE

Large program (100’s or 1000’s of lines) . Use top-down approach: Divide problem into tasks Dene a function (procedure, subroutine) to accomplish each task Code each function in its own module Specify interfaces (parameters to be passed, etc.) for each func- tion module . Essential concept for systems integration: The implementa- tion of a function is independent of its interface specication Implementation belongs to a dierent level in the hierarchy Each module is a “black box” to higher-level modules Design is the most important step in creating a large program . Most hard-to-nd bugs originate in a faulty design

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SIMULATORS

• 1 SPIM (“25th the performance at none of the cost”)  Simulates the RISC architecture (MIPS) most used in embedded systems (Nintendo 64, Sony PlayStation, ...)  Available for architectures other than the native one  The MIPS instruction set is simpler than most  The SPIM interface is better than many real debuggers ◦ Registers, data segment, text segment, stack  Documentation: Patterson & Hennessy, Appendix A; Waldron • XMPSIM  Simulates one Cray X-MP  Runs under DOS  Gives good view of pipeline timing, stalls, etc.

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TOPICAL OUTLINE

What’s inside the case? History — know it or repeat it Performance analysis Data representations Combinational logic circuits and Karnaugh maps MIPS and DEC Alpha assembly languages Sequential logic circuits Design of a simple arithmetic and logical unit (ALU) Design of the ALU control Designing a processor for speed: Pipelining Design of high-performance memory systems Input and output systems c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW TO GET ONTO THE WEB FROM HOME (1)

Basic requirements . PC (Windows 3.1, 95, 98 or NT, or linux) or Mac (System 7/8) . Modem (28.8 or 56k preferred) . Text-based communications program (not needed in Windows 95) Step 1: Find an Internet Services Provider (ISP) (UTD, for example) . PPP (point to point protocol) is used to communicate with the ISP . UTD free access: 60 hours of PPP time per month for every student . UTD RNA: unlimited use & no busy signal for $6.70/month . Commercial ISPs: $16–$25 per month for unlimited PPP time . AOL: $20/month for unlimited use

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW TO GET ONTO THE WEB FROM HOME (2)

Step 2: Get PPP software for your computer from your ISP, or by downloading from the Internet through Infoserv at UTD . PC (Windows 3.1): Trumpet Winsock PC (Windows 95/98/NT): Included with OS . Mac (System 7/8): Open Transport 1.x, FreePPP Step 3: Learn how to use PPP to communicate with your ISP . The PPP software can be used to issue text commands to the modem and to the UTD Annex system (for example) . Although it’s possible to write a script for Trumpet or FreePPP, it’s best to log in manually at rst . Write a script later if you want to

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW TO GET ONTO THE WEB FROM HOME (3)

Step 4: Get Web browser software for your computer from your ISP, with your operating system, or by downloading from the Internet through Infoserv at UTD . Netscape Navigator 4.5, or Netscape Communicator 4.5, from http://www.netscape.com/ . Microsoft Internet Explorer 5.0 comes with Windows 98, or you can get it from http://www.microsoft.com/ Surf’s up!

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WHAT’S INSIDE THE CASE: A LOGIC BOARD

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW DID WE GET HERE?

• Where were we, and what sort of future did we see? “Computers in the future may have only 1,000 vacuum tubes and 1 perhaps only weigh 1 2 tons.”—Popular Mechanics (1949) • Where are we? “As an AT&T executive observed last year, the cost of computing has fallen 10 million-fold since the was invented in 1971. That’s the equivalent of getting a Boeing 747 for the price of a pizza. If this innovation had been applied to automotive technology, a new car would cost about $2; it would travel at the speed of sound; and it would go 600 miles on a thimble of gas.”—Bob Herbold, COO of Microsoft, in “letter to Ralph Nader” (11/13/97)

c C. D. Cantrell (08/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MARKETS FOR EARLY COMPUTERS

• Scientific market Math tables • Military market Artillery tables Decryption Thermonuclear weapon design • Business market Payroll • Government market Census tabulation

c C. D. Cantrell (01/1999)

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A 17th-CENTURY ARTILLERY TABLE

http://info.ox.ac.uk/departments/hooke/geometry/fig16n.gif) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE “MIKE” THERMONUCLEAR DEVICE

Eniwetok Atoll, November 1, 1952

http://www.windows.umich.edu/sun/Solar interior/Nuclear Reactions/Fusion/h-bomb.jpg THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE MAIN COMPONENTS OF A COMPUTER

Computer

Processor Memory Peripheral Devices

Control Input

Datapath Output

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MECHANICAL CALCULATORS AND COMPUTERS

• 19th century technology: Gears and levers Mechanical inertia ⇒ slow processing speed Unreliable in large systems • Non-programmable systems Hand-operated calculators Cash registers ◦ Large market; National Cash Register Co. • Programmable systems Charles Babbage’s Difference Engine and Analytical Engine (deci- mal representation) Konrad Zuse’s Z-1 (binary representation) Howard Aiken’s Mark-I computer at Harvard (instructions on paper tape) (decimal) c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EARLY CASH REGISTERS

http://www3.ncr.com/history/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BABBAGE’S DIFFERENCE ENGINE

c Prentice-Hall 1995 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTROMECHANICAL CALCULATORS AND COMPUTERS

• Early 20th century technology • Special-purpose systems Motor-driven mechanical calculators Cash registers Enigma cipher machine • Programmable systems Punched-card sorters ◦ Punched cards: Herman Hollerith, 1890’s ◦ IBM Machines using relays to construct logic gates ◦ Konrad Zuse’s ◦ Howard Aiken’s Mark-II (Harvard) ◦ Originated the term “bug” for a hardware malfunction c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HERMAN HOLLERITH’S CARD SORTER

http://turnbull.dcs.st-and.ac.uk/history/Bookpages/Hollerith machine.jpeg THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AN 80-COLUMN PUNCHED CARD

http://turnbull.dcs.st-and.ac.uk/history/Bookpages/Hollerith machine.jpeg THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTROMECHANICAL RELAY SWITCH THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE FIRST COMPUTER BUG

• The first bug was found by Grace Hopper in an electromechanical relay of the Mark-II computer at Harvard

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE ENIGMA CIPHER MACHINE

The Enigma was a rotor-based cipher machine used in World War II • by German forces, who thought the Enigma’s code was secure.

c C. D. Cantrell (08/2002) c Smithsonian Institution 1991 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE “PURPLE” CIPHER MACHINE

The “purple” cipher machine, which used telephone stepping switches • instead of rotors, was used in World War II by the Japanese diplomatic corps. The machine shown was constructed by U.S. cryptanalysts purely from analysis of coded messages.

c C. D. Cantrell (08/2002) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTRONIC COMPUTERS — GENERATION 1

• Dominant technologies: Processors: Vacuum tubes (thermionic cathodes were unreliable) Memory: Acoustic delay line, magnetic drum • Special-purpose systems John Atanasoff’s experimental machine Colossus (used to decrypt the German Enigma cipher) Hoffman’s computer used to decrypt the Japanese Enigma cipher • Programmable systems ENIAC I–IV (J. Prosper Eckert, John Mauchly) • Stored-program systems (John von Neumann) Experimental projects: EDVAC (von Neumann), EDSAC (Maurice Wilkes)

Commercial products: UNIVAC I, IBM 701, IBM 704 (fp hardware)c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VACUUM TUBES

Left: A Western Electric triode in its glass envelope Right: A cutaway view of a pentode, showing the flow of electrons from the cathode (inside) to the plate (outside)

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science THE ENIAC I THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STORED-PROGRAM COMPUTERS

• Instructions and data stored as numbers in a common memory Same hardware can be used to load/store instructions and data An instruction must be referenced by its address in memory Programs are software, because they are not hardwired Same hardware can execute many different programs ⇒ general-purpose computers Same memory can hold several different programs, and the proces- sor can switch execution from one to another In principle, permits self-organizing structures ◦ This feature rarely used intentionally ◦ Self-modification by a program usually ⇒ trouble! Formulation of concept, design of EDVAC: John von Neumann

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE (1)

• General-purpose, stored-program computer Instructions and data are stored in the same format in memory: Instructions are data In principle, an executing program can modify itself ◦ In practice, this usually means that an error has occurred! • Strictly sequential execution Instruction fetched from memory, then Instruction decoded, then Data fetched from memory, then Instruction executed, then Result stored to memory

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTRONIC COMPUTERS — GENERATION 2

• Dominant technologies: Processors: Discrete transistors Memory: Magnetic “cores” • General-purpose computers IBM System/360 ◦ Same architecture used for computers varying widely in price and performance ◦ Dominated the market for large business computers Minicomputers: Digital Equipment PDP-8 • Supercomputers Control Data 6600 ◦ Major innovations (load-store architecture, pipeline)

c C. D. Cantrell (01/1999) THE CONSOLE OF THE CONTROL DATA 6600 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTRONIC COMPUTERS — GENERATION 3

• Dominant technologies: Processors: Integrated circuits Memory: ◦ Early: Magnetic “cores” ◦ Later: Integrated circuits (SRAM, DRAM) • General-purpose computers IBM System/370 Minicomputers: Digital Equipment PDP-11, VAX 11-780 • Supercomputers Control Data 7600 Cray-1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE CONTROL DATA 7600

c Cray Research, Inc. THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SEYMOUR CRAY

c Associated Press THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MEMORY BASED ON FERRITE CORES

http://www.wins.uva.nl/faculteit/museum/core detail.gif THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTRONIC COMPUTERS — GENERATION 4

• Dominant technologies: Processors: Custom LSI or VLSI modules on PC board Memory: SRAM, DRAM • General-purpose computers IBM 3xxx, 43xx series Minicomputers: Digital Equipment VAX 8400 • Minisupercomputers: Convex C-1, C-2, C-3 series • Vector supercomputers Cray-2, Cray X-MP, Cray Y-MP Fujitsu, Hitachi, NEC

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE CRAY X-MP

• Up to 4 processors • Up to 8M 64-bit words of static RAM memory • Designed by Steve Chen

c Cray Research, Inc. THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ELECTRONIC COMPUTERS — GENERATION 5

• Dominant technologies: Processors: Mass-produced Memory: SRAM, DRAM Compilers • RISC processors MIPS (Silicon Graphics) PA-RISC (Hewlett Packard) SPARC (Sun Microsystems) Alpha (Digital Equipment) PowerPC (Apple, Motorola, IBM) Motorola 88000 i860, i960 (Intel) • CISC processors

80x86, Pentium, Pentium Pro, Pentium II & III (Intel) c C. D. Cantrell (08/1999) THE RISC REVOLUTION

350 DEC Alpha

300

250 1.58x per year

200 SPECint rating

DEC Alpha 150 IBM Power2 DEC Alpha 100 1.35x per year HP 9000 50 MIPS IBM MIPS R3000 Power1 SUN4 R2000

0

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995

Year

John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE INTEL 4004 MICROPROCESSOR R2000 microprocessor

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LESSONS FROM COMPUTER HISTORY

• Be aware of new technology Opportunities and threats Current skills ⇒ personal mobility • Minimize your product’s time to market Miss the window of opportunity, and your job may be history • Use open or widely-adopted standards Compatibility with other vendors’ products; remember the TI PC! Lower development, manufacturing and support costs Respect hardware, OS, networking standards • Niche markets can be highly profitable CRAY, Apple, Sun, SGI Vulnerable to new technology

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DESIGN

• Always involves tradeoffs Get the best performance for a given cost Get the lowest cost for a given performance • Design tools: Computer-aided layout and schematic capture tools Tools to check that design rules are obeyed Tools to check layout vs. schematic Simulation tools to check for faults and to extract parasitics Performance analysis tools

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PERFORMANCE

Performance is a measure of how fast something works Who cares? . Understanding it enables you to get the best deal Which system has the most “bang for the buck”? . Performance determines whether a problem is numerically solvable and what methods can be used to solve it Program takes too long to run can’t solve problem ⇒ . Leads to understanding of the underlying motivation for a given computer organization Why is one system better than another? Which hardware components are a problem? How does the instruction set aect performance? Why is some hardware better than others for dierent programs? c D. Hollenbeck (1/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PERFORMANCE IS TIME

Transportation performance metrics . Throughput Passenger miles per month Performance improves with higher volume and/or distance . Passengers (users) measure performance in terms of time in-ight Long-distance telecommunication performance metric .BL= bits per second (B) times distance (L) . Performance improves with higher speed (B) and greater distance (L) Computer performance metrics . For users: Execution time or response time . Transaction processing systems: Throughput

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A USER’S DEFINITION OF PERFORMANCE

Execution time := total time required to run program . “wall-clock time” . Most signicant parameter for: Product development Research Performance := 1/(execution time) . Aected by: Processor speed Concurrency of execution I/O speed Type of program System workload

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RELATIVE PERFORMANCE

Relative performance is the performance of one system compared to another, and is given by the performance ratio. . The performance ratio of System A to System B is:

Performance Execution A = n = B PerformanceB ExecutionA

. System A is n times faster than System B.

c D. Hollenbeck (1/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXAMPLE OF COST-PERFORMANCE ANALYSIS (1)

Consider: . Systems S1, S2, with costs C1 and C2

. Programs P1 and P2, with execution times T1 and T2 The cost-performance ratio for program P on system S is j i cost of Si = Ci Tj performance on Pj

A lower value of Ci Tj means better performance for a given cost, or lower cost for a given performance Example: System C ($) T (s) C T T (s) C T i 1 i 1 2 i 2 S1 1,500 10 15,000 5 7,500 S2 3,000 3 9,000 3 9,000

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNIX TIME COMMAND (1)

wotan% time linpackd.exe 12.7u 0.1s 0:13 99% 0+764k 1+0io 0pf+0w Interpretation of this example: . 12.7 seconds user time (u) — time spent running the program linpackd.exe . 0.1 seconds system time (s) — time spent by the Unix kernel on behalf of linpackd.exe . 13 seconds elapsed time . CPU time := user time + system time = 99% of elapsed time . Average memory used = 0 kb unshared + 764 kb shared . block I/O (io): 0 blocks input, 1 blocks output . 0 page faults (pf) . 0 swaps (w)

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNIX TIME COMMAND (2)

tulia% time linpackd.exe 8.5u 0.3s 0:13 66% 0+804k 4+0io 41pf+0w Interpretation of this example: . 8.5 seconds user time 12.7 tulia’s CPU performance = wotan’s CPU performance 8.5 . 13 seconds elapsed time tulia and wotan had the same performance on this job . tulia’s CPU time for this program = 66% of elapsed time 34 % of the elapsed time was spent waiting for other jobs . 41 page faults (pf) The hard disk was accessed 41 times for data in virtual memory

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CPU PERFORMANCE

1 CPU performance := user CPU time + system CPU time . Gives no information about performance of subprograms . Depends strongly on program, compiler options, and workload

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPUTER CZAR’S DEFINITION OF PERFORMANCE

Throughput := total work done per unit time Utilization := % of available time spent executing user jobs . Increased by: Having jobs enqueued, awaiting execution Decreasing parallelization (no. of processors used by each job) Maximum throughput means: . High utilization . Low performance from user’s point of view . Longer times to develop products or obtain research results

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CLOCK CYCLES

Instead of using seconds to measure execution time, often we use clock cycles, aka clock ticks, clock periods, clocks, or cycles. . (frequency) = cycles per second. . Measured in Hertz (1 Hz = 1 cycle/s). Clock period is the time between ticks of the clock and is measured in seconds per cycle. . Period = 1/frequency Example: A 200 MHz (MegaHertz) clock has a clock period of 1 9 =5 10 seconds = 5 nanoseconds. 200 106 Hz Warning: Some people refer to the clock period as the clock rate; they are not the same thing!

c D. Hollenbeck (1/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FUNDAMENTAL EQUATION FOR CPU TIME (1)

CPU time required to run a program: Instructions CPU execution time = Program Clock periods Instruction Seconds Clock Period Instructions is determined by: Program . The instruction set architecture . The compiler . The program

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FUNDAMENTAL EQUATION FOR CPU TIME (2)

Clock periods is determined by: Instruction . Logic design (cost tradeos) . Instruction mix

Seconds is determined by: Clock period . Semiconductor device properties . Logic design

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CLOCK PERIODS PER INSTRUCTION (CPI) (1)

CPI is the average number of clock periods per instruction: clock periods C CPI := = instruction count I .I = total number of instructions of all types .C = total number of clock periods Suppose there are N types of instructions .Ik = number of instructions of type k; I = k Ik . CPIk = clock periods to execute one instruction of type k N N C Ik C = Ik CPIk CPI = =   CPIk k=1 k=1 ⇒ I  I    N CPI = Fk CPIk where Fk = fraction of type k instructions k=1 Ik Fk = 0 Fk 1 c C. D. Cantrell (09/1998) I ⇒ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CLOCK PERIODS PER INSTRUCTION (CPI) (2)

Example for a MIPS R2000 processor with a particular instruction mix:

k Instr. type F CPI F CPI k k k k 1 Load 0.30 2 0.60 2 Store 0.15 2 0.30 3 ALU op. 0.40 1 0.40 4 Branch 0.15 2 0.30 4 CPI = Fk CPIk 1.6 k=1

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

NUMBER OF CPU CLOCK CYCLES

Two independent methods of computing NC = number of cycles in one program: . In terms of execution time and clock frequency: NC=ET CF . In terms of instruction count and clocks per instruction: NC=IC CPI . From the equation ET CF=IC CPI one can nd any one of ET, CF, IC or CPI, given the other three

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CLOCK PERIODS PER INSTRUCTION (CPI) (3)

Computation of CPI using CPU time for two dierent systems S1, S2 running a program with 100 106 instructions System Clock rate (MHz) CPU time (s) Clock periods(NC) CPI 6 S1 100 10 1000 10 10 S 200 3 600 106 6 2

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A PEAK PERFORMANCE METRIC: PEAK MIPS (1)

MIPS := Millions of . The peak MIPS rating is calculated using the theoretical peak rate at which instructions can be issued For average MIPS, use the fundamental performance equation: Instruction Count Instruction Count MIPS = = Execution Time 106 Clock Periods Cycle Time 106 Instruction Count Clock Frequency = Instruction Count CPI 106 Then

Clock Frequency average MIPS = CPI 106

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A PEAK PERFORMANCE METRIC: PEAK MIPS (2)

MIPS := Millions of Instructions Per Second . The peak MIPS rating is calculated using the theoretical peak rate at which instructions can be issued . Neglects most real software and hardware properties: Memory-cache or memory-CPU bandwidth I/O speed Type of program (locality of instruction references) System workload . Also known as “Meaningless Indicator of Processor Speed”

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MEGAFLOPS

MFLOPS := Millions of Floating-Point Operations Per Second . Peak (“guaranteed not to exceed”) MFLOPS: Theoretical maximum rate at which oating-point operations can be performed (assuming all FP operations take equal time) Example: Cray Y-MP, 1 processor Clock period = 6 ns 1 addition + 1 multiplication possible on each cycle Peak MFLOPS = 2 clock frequency = 333 MFLOPS . MFLOPS measured by a benchmarking program: A program that can do useful work, e.g., LINPACK A synthetic benchmark, such as: Livermore loops SPEC

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE LINPACK BENCHMARK (1)

LINPACK solves a system of linear equations Ax = b using the Gaussian elimination algorithm . Generates a random coecient matrix A and right-hand side b . Timing carried out in the program (not externally): call matgen(a,lda,n,b,norma) t1 = secnds(0.0) call dgefa(a,lda,n,ipvt,info) time(1,1) = secnds(0.0) - t1 . . Executes 2n3/3+2n2 oating-point operations (where matrix A is n n)

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE LINPACK BENCHMARK (2)

2 standard benchmark problem sizes: . 100 100: No code changes allowed Array size so small that pipeline latency is signcant May be the most indicative standard benchmark for computa- tionally intensive engineering/scientic programs . 1000 1000: Any code change is allowed that does not change the problem being solved

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE SPEC BENCHMARK

SPEC := Standard Performance Evaluation Corporation . Current CPU intensive benchmark suites: SPECint95 integer SPECfp95 floating point . Other benchmarks: SDM UNIX Software Development Workloads SFS System level file server (NFS) workload SPEChpc96 High-performance computing benchmarks SPEC benchmarks are small “kernels” of code extracted from real programs . May run unrealistically fast . Programs with large arrays that are accessed using a large stride in memory are likely to run much more slowly than the same program with smaller arrays and a smaller stride c C. D. Cantrell (04/1999) 350 DEC Alpha

300

250 1.58x per year

200 SPECint rating

DEC Alpha 150 IBM Power2 DEC Alpha 100 1.35x per year HP 9000 50 MIPS IBM MIPS R3000 Power1 SUN4 R2000

0

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995

Year

Growth in microprocessor performance since the mid 1980s has been substantially higher than in earlier years SPECint95 Comparison

16

14

12

PPC 750 10 PPC 740 PPC 604e PPC 603e 8 PPC 604 Pentium

SPEC Score Pentium MMX 6 Pentium Pro Pentium II Alpha 21164 4

2

0 120 133 150 160 166 180 200 220 225 233 240 250 266 300 333 350 Processor Speed (MHz) SPECfp95 comparison

16

14

12

PPC 750 10 PPC 740 PPC 604e PPC 603e 8 PPC 604 Pentium Pentium MMX SPECfp95 Score SPECfp95 6 Pentium Pro Pentium II Alpha 21164 4

2

0 120 133 150 160 166 180 200 220 225 233 240 250 266 300 333 350 Processor Speed (MHz) SPECint95 Comparison

50

45

40

35

Pentium II 30 PPC 604e PPC 750 25 Ultra II Ultra Iii SPEC Score 20 Alpha 21164a Alpha 21264

15

10

5

0 300 333 350 367 400 433 450 475 500 533 550 575 600 650 667 Processor Speed (MHz) SPECfp95 comparison

70

60

50

Pentium II 40 PPC 604e PPC 750 Ultra II Ultra IIi 30

SPECfp95 Score SPECfp95 Alpha 21164a Alpha 21264

20

10

0 300 333 350 367 400 433 450 475 500 533 550 575 600 650 667 Processor Speed (MHz) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SPEEDUP

Denition of speedup: Original Execution Time Speedup := Improved Execution Time The maximum speedup that can be obtained by using a faster mode of execution is limited by the fraction of the time the faster mode can be used.

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AMDAHL’S LAW (1)

• Calculation of speedup due to an enhancement of part of a system or program: Old execution time Overall speedup = New execution time where New execution time = Execution time of unenhanced part + Execution time of enhanced part = (Old execution time) × (1 − Fraction enhanced) (Old execution time) × (Fraction enhanced) + Speedup of the enhanced part = (Old execution time)      Fraction enhanced  × (1 − Fraction enhanced) +  Speedup of the enhanced part

c C. D. Cantrell (02/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AMDAHL’S LAW (2)

• Overall speedup due to an enhancement: 1 S = overall f (1 − f)+ Senhanced

Senhanced = speedup of the enhanced part f = fraction of executable program that can be enhanced • The maximum speedup that can be obtained by using a faster mode of execution is limited by the fraction of the time the faster mode can be used. Examples include: Improving your Web-surfing speed Parallelization of computer programs Performance of symmetric-multiprocessor systems Design of an instruction set architecture

c C. D. Cantrell (02/2000) ) AMDAHL’S LAW overall

S Senhanced =10 OVERALL SPEEDUP ( SPEEDUP OVERALL

FRACTION ENHANCED (f ) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AMDAHL’S LAW (3)

Example 1: Speedup from the use of cache memory, assuming that . Cache is 5 times faster than main memory . Cache can be used 90% of the time Result: 1 Overall speedup = 0.9 =3.57 (1 0.9) + 5 Example 2: Same as example 1, but assuming that cache can be used only 50% of the time: 1 Overall speedup = 0.5 =1.67 (1 0.5) + 5

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AMDAHL’S LAW (4)

Example 3 (Real-world): At the Los Alamos and Lawrence Livermore National Laboratories, 10 years of work by highly skilled programmers resulted in only 70% vectorization on most of the workload . Vectorized code runs up to 10 times faster than scalar code . Amdahl’s Law predicts that the average speedup from vectorization is 1 Overall speedup = =2.70 0.7 (1 0.7) + 10 . Result: The “killer micros”, whose scalar speed is equal to or better than the scalar speed of the CRAY vector processors, took over a signicant share of CRAY’s market at the national laboratories

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BENCHMARKING A PARALLELIZED PROGRAM

• Benchmark results for a program running on one processor: 6629.0u 5.0s 1:51:21 99% 0+0k 0+0io 0pf+0w • Benchmark results for the same program running in parallel on four processors (same system): 7779.0u 9.0s 32:57 393% 0+0k 0+0io 0pf+0w • Analysis: One processor: User time = 6629 seconds Four processors: Sum of user times = 7779 seconds Sum of times on 4 CPUs = 393% of elapsed time Overall speedup = ratio of elapsed times = 3.38

Theoretical maximum speedup = Senhanced =4 Actual speedup = Soverall =3.38 = 84% of theoretical maximum 1 1 − Soverall Implied fraction enhanced = f = 1 =0.9 1 − Senhanced c C. D. Cantrell (02/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

AMDAHL’S LAW (5)

• Example 4 (the RISC revolution): Several different groups “discovered” more or less independently that ∼ 10 instructions account for more than 90% of the executions Suppose that one can design hardware that executes almost all of the top 10 instructions in one clock period, instead of 10 Amdahl’s Law predicts that the overall speedup is 1 Overall speedup = =5.26 0.9 (1 − 0.9) + 10 Result: REDUCED INSTRUCTION SET COMPUTER (RISC) architectures such as the MIPS R2000 General principle: “Make the common case fast” When Sun Microsystems changed from Motorola 68020 (CISC) processors to SPARC (RISC) processors at the same clock frequency, performance improved by a factor of ∼ 5 c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TOP 10 INSTRUCTIONS FOR THE 80x86

Rank 80x86 instruction % of total integer executions 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move (register–register) 4% 9 call 1% 10 return 1% Total for top 10 instructions 95%

c C. D. Cantrell (02/2000) ) overall

S Senhanced =10 OVERALL SPEEDUP ( SPEEDUP OVERALL

FRACTION ENHANCED (f ) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PRODUCT DEVELOPMENT (1)

Performance evaluation as part of the product development process . Goal: Optimize performance within the cost envelope for your market Optimize aspects of performance with the greatest eect on sales Understand your market! Customer contacts Usenet newsgroups Make an engineering analysis of costs vs. performance for a range of designs

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CONTRIBUTIONS TO PRODUCT COST (1)

Component costs Direct costs – costs directly related to making a product . Labor . Purchasing costs . Warranty service and replacement . Scrap Gross margin – indirect costs that cannot be billed to one product . R & D (typically 8% to 15% of revenues) . Marketing & sales . Equipment maintenance, rental, etc. . Financing costs, taxes, fees, ... . Prot

c C. D. Cantrell (09/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE DATAPATH

• One of the five basic components of a computer • All datapath operations are based on one simple idea: Operate repetitively on chunks of data using the same algorithm A “chunk” can be a bit, a byte (8 bits), a short word (16 bits), a long word (32 bits), or a quad word (64 bits) ] The MIPS instruction set architecture uses words of 32 or 64 bits • RISC machines load data chunks into registers before operating on them The cycle of datapath operations is ◦ Load data from main memory into registers ◦ Operate on the data using the ALU, FPU, etc. ◦ Store the result back to main memory

c C. D. Cantrell (09/1999) DATA BIT 2 BIT 1 BIT 0 BIT 3 : TO ALL BITS IN PARALLEL TO THE SAME ALGORITHM IS APPLIED CONTROL THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA REPRESENTATIONS

pioneered by Howard Aiken: Separate memories for instructions and data Logical extension: Separate memory for each kind of data ◦ Advantage: Data types can be recognized unambiguously ◦ Disadvantage: Inefficient utilization of memory • von Neumann architecture: A single memory is used for both instructions and data All types of data are stored in the same memory ◦ Advantage: Efficient utilization of memory ◦ Disadvantage: Data types cannot be recognized unambiguously • The correct design decision is generally to use a single kind of memory for all data Sofware has to decide how a particular data item should be interpreted c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ON THE INTERPRETATION OF WORDS...

In a computer’s memory (RAM, CPU registers, etc.) data is organized into words . A MIPS word is 32 bits long . A PC word is 16 bits long . A word is just a string of 1’s and 0’s . Software determines how the word is interpreted Example: 10001101001010000000000001011000 . Integer: 2, 368, 208, 984 (unsigned) or 1, 926, 758, 312 (signed) 31 . Floating point: 5.177 10 . 4-character string: (2X . Hardware instruction: lw $t0, 88($t1)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REPRESENTATIONS OF INTEGERS (1)

The Roman representation of numbers: MCMXCVI = 1996 . M = 1000, C = 100, X = 10, V = 5, XC = 90, etc. . X C = M, etc. — a nightmare! A better idea — positional notation: . Decimal representation (base 10) 1996 =1 103 +9 102 +9 101 +6 100 10 Each digit multiplies the power of the base (10) that corresponds to the digit’s position, starting with 0 at the right and increasing to the left To multiply by the nth power of the base (10 here), shift left n places

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNSIGNED INTEGERS (1)

Unsigned integer: A representation of a non-negative integer using binary positional notation Examples: . Unsigned byte: 1011 01012 = B516 = 18110

Range of unsigned byte values is 0 to 25510 . Unsigned 32-bit integer:

1111 0111 0011 0001 1011 1001 1111 11002 = F7 31 B9 FC16 =4, 147, 231, 22810

Range of unsigned 32-bit integer values is 0 to 4, 294, 967, 29510 Used to represent computer addresses, characters, image data

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REPRESENTATIONS OF INTEGERS (2)

Octal representation (base 8): (255) = (377) =3 82 +7 81 +7 80 10 8 . Octal digits are 0, 1, 2, 3, 4, 5, 6, 7

. Convert 19610 to the octal representation: 64=82 < 196 < 83 = 512 highest power of 8 is 82 ⇒ 196 64 = 3 (remainder = 4) 1=80 < 4 < 8=81 next highest power of 8 is 80 ⇒ 196 =3 82 +0 81 +4 80 = 304 10 8 . For more ecient conversion methods, see slides on base conversion . The basic arithmetic operations (addition, subtraction, multipli- cation and division) use shifts and carries, just as in the decimal representation

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REPRESENTATIONS OF INTEGERS (3)

Binary representation (base 2): (255)10 = (11111111)2 =1 27 +1 26 +1 25 +1 24 +1 23 +1 22 +1 21 +1 20 =28 1 . Binary digits are 0, 1 . Binary digIT = bit . The binary representation is convenient for logic design (only 2 voltage levels — high and low) . Binary representations of numbers are inconvient for humans to write or read (too many digits)

c C. D. Cantrell (01/1999) BINARY-TO-DECIMAL CONVERSION OF UNSIGNED 8-BIT INTEGERS number to be converted ➟ 1 0 1 1 0 0 1 1 128 64 32 16 powers of 2 ➟ 8 4 2 1

result = 128+32+16+2+1 = 179

© C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REPRESENTATIONS OF INTEGERS (4)

ASCII representation of characters as unsigned integers: C 0100 0011 =67 ↔ 2 10 c 0110 0011 =99 ↔ 2 10 . A character is stored as 1 byte in computer memory . In C, characters are 1-byte integers

. Maximum ASCII value is 12710 = 0111 11112

. Largest unsigned integer representable with 8 bits is 25510 = 1111 11112 . Usually one uses a hexadecimal representation to keep from writing too many digits

c C. D. Cantrell (01/1999) REPRESENTATIONS OF INTEGERS (5)

Hexadecimal digits (base 16): (0)16 = (0)10 = (0000)2, (1)16 = (1)10 = (0001)2, (2)16 = (2)10 = (0010)2, (3)16 = (3)10 = (0011)2, (4)16 = (4)10 = (0100)2, (5)16 = (5)10 = (0101)2, (6)16 = (6)10 = (0110)2, (7)16 = (7)10 = (0111)2, (8)16 = (8)10 = (1000)2, (9)16 = (9)10 = (1001)2, (A)16 = (10)10 = (1010)2, (B)16 = (11)10 = (1011)2, (C)16 = (12)10 = (1100)2, (D)16 = (13)10 = (1101)2, (E)16 = (14)10 = (1110)2, (F)16 = (15)10 = (1111)2. THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REPRESENTATIONS OF INTEGERS (6)

Hexadecimal representation (base 16): (3F)16 = (0011 1111)2 = (63)10

(FF)16 = (1111 1111)2 = (255)10 . 1 hexadecimal digit = 4 bits . Easy to convert between hexadecimal and binary . Hexadecimal representations are much more compact than binary representations . In C, SAL and MIPS assembler, a hexadecimal representation is preceded with 0x (zero followed by lowercase x):

0x3F stands for (3F)16

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BASE CONVERSION (1)

Conversion from binary representation (base 2) to base 2k . In base 2k, there are 2k dierent digits (one for every possible pat- tern of k bits) . Rule for conversion of the binary representation of any unsigned integer n: Group the binary digits of n into sets of k bits each, starting at the least siginicant bit If necessary, add zero digits at the most left (most signicant) end of the binary representation of n to make a full set of k bits Now convert the sets of k bits into digits in base 2k . Example: Convert 10110101 to base 8 = 23:

10110101 = 010 110 101 = 2658 2 6 5 | {z } | {z } | {z }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BASE CONVERSION (2)

Conversion from decimal representation (base 10) to base k . Let dk = coecient of , n = number to be converted, dldl 1 d0 = base- expansion of n (to be computed) . Recursive division algorithm: l is the integer such that l n<l+1 n l Set k = l, and compute dl = = integer part of n/ $l % Set n n d k, k k 1, and repeat until k =0 7→ k 7→ . Example: Convert 231010 to base 16: 162 = 256 < 2310 < 4096 = 163 l =2 2310 ⇒ d2 = =9 256 n 2310 9 256 = 2310 2304 = 6 7→d = 0 and d =6 2310 = 906 ⇒ 1 0 10 16

c C. D. Cantrell (01/1999) ASCII CONTROL CODE CHART

b7 0 0 0 0 1 1 1 1 b6 0 0 1 1 0 0 1 1 b5 0 1 0 1 0 1 0 1 BITS SYMBOLS CONTROL UPPER CASE LOWER CASE NUMBERS b4 b3 b2 b1 0 16 32 48 64 80 96 112 0000 NUL DLE SP 0 @ P ‘ p 0010 20 20 40 30 60 40 100 50 120 60 140 70 160 1 17 33 49 65 81 97 113 0001 SOH DC1 ! 1 A Q a q 1111 21 21 41 31 61 41 101 51 121 61 141 71 161 2 18 34 50 66 82 98 114 0010 STX DC2 ” 2 B R b r 2212 22 22 42 32 62 42 102 52 122 62 142 72 162 3 19 35 51 67 83 99 115 0011 ETX DC3 # 3 C S c s 3313 23 23 43 33 63 43 103 53 123 63 143 73 163 4 20 36 52 68 84 100 116 0100 EOT DC4 $ 4 D T d t 4414 24 24 44 34 64 44 104 54 124 64 144 74 164 5 21 37 53 69 85 101 117 0101 ENQ NAK % 5 E U e u 5515 25 25 45 35 65 45 105 55 125 65 145 75 165 6 22 38 54 70 86 102 118 0110 ACK SYN & 6 F V f v 6616 26 26 46 36 66 46 106 56 126 66 146 76 166 7 23 39 55 71 87 103 119 0111 BEL ETB ’ 7 G W g w 7717 27 27 47 37 67 47 107 57 127 67 147 77 167 8 24 40 56 72 88 104 120 1000 BS CAN ( 8 H X h x 81018 30 28 50 38 70 48 110 58 130 68 150 78 170 9 25 41 57 73 89 105 121 1001 HT EM ) 9 I Y i y 91119 31 29 51 39 71 49 111 59 131 69 151 79 171 10 26 42 58 74 90 106 122 1010 LF SUB * : J Z j z A121A 32 2A 52 3A 72 4A 112 5A 132 6A 152 7A 172 11 27 43 59 75 91 107 123 1011 VT ESC + ; K [ k B131B 33 2B 53 3B 73 4B 113 5B 133 6B 153 7B{ 173 12 28 44 60 76 92 108 124 1100 FF FS , < L l C141C 34 2C 54 3C 74 4C 114 5C\ 134 6C 154 7C| 174 13 29 45 61 77 93 109 125 1101 CR GS = M ] m D151D 35 2D 55 3D 75 4D 115 5D 135 6D 155 7D} 175 14 30 46 62 78 94 110 126 1110 SO RS . > N ˆ n ˜ E161E 36 2E 56 3E 76 4E 116 5E 136 6E 156 7E 176 15 31 47 63 79 95 111 127 1111 SI US / ? O o DEL F171F 37 2F 57 3F 77 4F 117 5F 137 6F 157 7F 177 LEGEND: dec Victor Eijkhout CHAR CSRD / University of Illinois hex oct Champaign/Urbana Illinois 61801, USA THE ASCII CODE

decimal hexadecimal character description 0 00 NUL null character 1 01 SOH start of heading 2 02 STX start of text 3 03 ETX end of text 4 04 EOT end of transmit 5 05 ENQ enquiry 6 06 ACK acknowledge 7 07 BEL bell (alert) 8 08 BS backspace 9 09 HT horizontal tab 10 0A LF line feed 11 0B VT vertical tab 12 0C FF form feed 13 0D CR carriage return 14 0E SO shift out 15 0F SI shift in 16 10 DLE data line escape 17 11 DC1 device control 1 18 12 DC2 device control 2 19 13 DC3 device control 3 20 14 DC4 device control 4 21 15 NAK negative acknowledge 22 16 SYN synchronous idle 23 17 ETB end of transmit block 24 18 CAN cancel 25 19 EM end of medium 26 1A SUB substitute 27 1B ESC escape 28 1C FS le separator 29 1D GS group separator 30 1E RS record separator 31 1F US unit separator 32 20 SPACE space 33–47 21–2F !"#$%&’()*+,-./ punctuation 48–57 30–39 0–9 decimal digits 58–64 3A–3F :;<=>? punctuation 65–90 41–5A A–Z uppercase letters 91–96 5B–60 [ ]ˆ ‘ punctuation 97–122 61–7A \a–z lowercase letters 123–126 7B–7E ˜ punctuation 127 7F DEL{|} delete

c C. D. Cantrell (6/96) PRINTABLE ASCII CODES (1)

decimal hexadecimal character description 32 20 SPACE space 33 21 ! exclamation 34 22 " double quote 35 23 # pound sign 36 24 $ dollar sign 37 25 % percent 38 26 & ampersand 39 27 ’ right single quote 40 28 ( left parenthesis 41 29 ) right parenthesis 42 2A * asterisk 43 2B + plus 44 2C , comma 45 2D - hyphen 46 2E . period 47 2F / slash 48 30 0 zero 49 31 1 one 50 32 2 two 51 33 3 three 52 34 4 four 53 35 5 ve 54 36 6 six 55 37 7 seven 56 38 8 eight 57 39 9 nine 58 3A : colon 59 3B ; semicolon 60 3C < less than 61 3D = equal 62 3E > greater than 63 3F ? question 64 40 @ at sign

c C. D. Cantrell (06/1996) PRINTABLE ASCII CODES (2)

decimal hexadecimal character description 65 41 A uppercase A 66 42 B uppercase B 67 43 C uppercase C 68 44 D uppercase D 69 45 E uppercase E 70 46 F uppercase F 71 47 G uppercase G 72 48 H uppercase H 73 49 I uppercase I 74 4A J uppercase J 75 4B K uppercase K 76 4C L uppercase L 77 4D M uppercase M 78 4E N uppercase N 79 4F O uppercase O 80 50 P uppercase P 81 51 Q uppercase Q 82 52 R uppercase R 83 53 S uppercase S 84 54 T uppercase T 85 55 U uppercase U 86 56 V uppercase V 87 57 W uppercase W 88 58 X uppercase X 89 59 Y uppercase Y 90 5A Z uppercase Z 91 5B [ left bracket 92 5C backslash 93 5D \] right bracket 94 5E ˆ caret 95 5F underscore 96 60 ‘ left single quote

c C. D. Cantrell (6/96) PRINTABLE ASCII CODES (3)

decimal hexadecimal character description 97 61 a lowercase a 98 62 b lowercase b 99 63 c lowercase c 100 64 d lowercase d 101 65 e lowercase e 102 66 f lowercase f 103 67 g lowercase g 104 68 h lowercase h 105 69 i lowercase i 106 6A j lowercase j 107 6B k lowercase k 108 6C l lowercase l 109 6D m lowercase m 110 6E n lowercase n 111 6F o lowercase o 112 70 p lowercase p 113 71 q lowercase q 114 72 r lowercase r 115 73 s lowercase s 116 74 t lowercase t 117 75 u lowercase u 118 76 v lowercase v 119 77 w lowercase w 120 78 x lowercase x 121 79 y lowercase y 122 7A z lowercase z 123 7B left brace 124 7C { vertical bar 125 7D | right brace 126 7E }˜ tilde 127 7F DEL delete

c C. D. Cantrell (06/1996) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

END OF LINE CONVENTIONS

• Conventions for ending a line of text and starting a new line:

Operating system End of line (ASCII) C Hexadecimal UNIX LF \n 0A MacOS CR \r 0D VMS CR LF \r\n 0D0A MS-DOS CR LF \r\n 0D0A

c C.D.Cantrell(09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT UNSIGNED REPRESENTATION

0 15 1

0000 0001 2 14 1111 0010 1110 3 0011 13

1101 4-bit 0100

unsigned integer 4 12

1100 representation

0101

5 1011

0110 11

1010

0111

6

1001

1000 10

7 9 8

© C. D. Cantrell 1996

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MEMORY ADDRESSING

Logical structure of a computer’s random-access memory (RAM) . The generic term for the smallest unit of memory that the CPU can read or write is cell In most modern computers, the size of a cell is 8 bits (1 byte) . Hardware-accessible units of memory larger than one cell are called words Currently (1998) the most common word sizes are 32 bits (4 bytes) and 64 bits (8 bytes) . Every memory cell has a unique integer address The CPU accesses a cell by giving its address Addresses of logically adjacent cells dier by 1 The address space of a processor is the range of possible integer addresses, typically (0 : 2n 1)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BYTE-ADDRESSED MEMORY

n byte 2 Ϫ1 7 0

n byte 2 Ϫ2 7 0 . . .

byte 3 7 0

byte 2 7 0

byte 1 7 0

byte 0 7 0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BYTE-ADDRESSED MEMORY

byte FFFFFFFF 7 0

byte FFFFFFFE 7 0 . . .

byte 00000003 7 0

byte 00000002 7 0

byte 00000001 7 0

byte 00000000 7 0

addresses data 32-bit addressing: The address of a byte is a 4-byte (32-bit) word

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WORD ADDRESSING

In a byte-addressed memory, the addresses of successive words dier by the number of bytes in a word:

byte n+7 7 0

byte n+6 15 8 word address = n+4

byte n+5 23 16

byte n+4 31 24

byte n+3 7 0

byte n+2 15 8 word address = n

byte n+1 23 16

byte n 31 24

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BYTE ORDERING

Big-endian byte ordering . Most signicant (leftmost) byte has the lowest address . The address of a word is the address of its most signicant byte . Default byte ordering in MIPS, DEC Alpha, HP PA-RISC and IBM/Motorola/Apple PowerPC architectures . Only available byte ordering in SPARC and IBM 370 architectures Little-endian byte ordering . Least signicant (rightmost) byte has the lowest address . The address of a word is the address of its least signicant byte . Only available byte ordering in Intel 80x86, National Semiconductor NS 32000 and DEC Vax architectures

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BYTE ORDERING CONVENTIONS

byte 3 7 0 byte 3 31 24

byte 2 15 8 byte 2 23 16 word 0

byte 1 23 16 byte 1 15 8

byte 0 31 24 byte 0 7 0

Big-Endian Little-Endian

The arrows point in the direction of increasingly significant digits

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LITTLE-ENDIAN vs BIG-ENDIAN BYTE ORDERING

Aects the interpretation of multi-byte structures (4-byte words, etc.) Examples: . Strings: "MIPS" = 4D 49 50 53 (big-endian) = 53 50 49 4D (little-endian) But 53 50 49 4D = "SPIM" (big-endian) . Unsigned 32-bit integer (such as an IP address):

81 6E 10 5316 =2, 171, 474, 00310 (big-endian) But

53 10 6E 8116 =1, 393, 585, 79310 (little-endian) A problem for data transfer from one device to another! c C. D. Cantrell (01/1999) EXAMPLES SHOWING HOW TO READ A SPIM MEMORY DUMP

------

BITS AND BYTES:

1 byte = 8 bits = 2 hexadecimal digits 32 bits = 4 bytes = 8 hexadecimal digits

HEXADECIMAL NOTATION

In the standard C notation 0x12abcdef, the symbol "0x" means "treat what follows as an unsigned integer in the hexadecimal representation". For example, 0x12abcdef denotes the unsigned 32-bit integer that one writes as 313,249,263 in the unsigned decimal representation. 0x12abcdef also denotes 313,249,263 (base 10) in two's complement notation.

BYTE-ADDRESSED MEMORY

Modern microprocessors can address each byte of main memory from address 0x00000000 up to and including the maximum address allowed in the instruction set architecture. The maximum address in a 32-bit architecture is 0xffffffff.

WORD ADDRESSES

In a 32-bit architecture such as the MIPS R2000, words are 32 bits (4 bytes) in length. Addresses in main memory, registers and instructions are all 32 bits in length. Valid word addresses are 0x00000000, 0x00000004, 0x00000008, 0x0000000c (12 in decimal), 0x00000010 (16 in decimal), 0x00000014 (20 in decimal), 0x00000018 (24 in decimal), 0x0000002c (28 in decimal), and every other multiple of 4 up to and including 0xfffffffc.

The address of a 4-byte word is always the address of the byte with the lowest address.

BYTE ORDERING

Because memory is byte-addressed, the order in which the four bytes of a word in memory are assembled to form a numerical constant or an instruction is a matter of convention. Let's consider the following example of addresses and the contents (bytes) stored at each address:

Byte addresses Contents Word addresses

[0x00000007] 0xce > [0x00000006] 0x8a > [0x00000005] 0x46 > [0x00000004] 0x02 > 0x00000004 [0x00000003] 0xef } [0x00000002] 0xcd } [0x00000001] 0xab } [0x00000000] 0x12 } 0x00000000

Two conventions are common:

BIG-ENDIAN byte ordering (used in most architectures other than VAX or 80x86): The word at address 0x00000000 in the above example is read as 0x12abcdef. The word at address 0x00000004 is read as 0x02468ace. Note that the most significant ("biggest")bytehasthelowestaddress,andtherefore isthebytewhoseaddressistheaddressofthewhole word. (Hencethewordaddressistheaddressofthe "bigend".)

Inthisexample,thewordataddress0x00000000would beinterpretedasrepresenting313,249,263(decimal) inthe32-bittwo'scomplementrepresentation. The wordataddress0x00000004wouldbeinterpretedas 38,177,486(decimal)inthe32-bittwo'scomplement representation.

LITTLE-ENDIANbyteordering (usedinthe80x86architecture,thereforeinallPCs): Thewordataddress0x00000000intheaboveexampleis readas0xefcdab12. Thewordataddress0x00000004is readas0xce8a4602. Notethattheleastsignificant ("littlest")bytehasthelowestaddress,andtherefore isthebytewhoseaddressistheaddressofthewhole word. (Hencethewordaddressistheaddressofthe "littleend".)

Inthisexample,thewordataddress0x00000000would beinterpretedasrepresenting-271,733,998(decimal) inthe32-bittwo'scomplementrepresentation. The wordataddress0x00000004wouldbeinterpretedas -829,798,910(decimal)inthe32-bittwo'scomplement representation.

------

MEMORYDUMPS

SincethereareversionsofSPIMonbothbig-endianandlittle- endianarchitectures,wegiveexamplesforallofthe architecturesonwhichstudentsarelikelytorunSPIM.

BIG-ENDIAN(Motorola680x0,IBMPowerPC,SunSPARC,DECAlpha, HPPA-RISC,MIPS)

Here'sthefirstpartofaSPIMdisplayofthedatasegment oftheprogramdisplay.sal(listedbelow),withextra annotationstoexplainwhat'sgoingon:

DATA [0x10000000]...[0x1000fffc]0x00000000 <-----rangeofword-----> ^^^^^^^^ addresses;each | wordcontains | 0x00000000------

Wordaddresses:

[0x10010000]0x43000000 0x00007fff 0x46fffe00 0x48656c6c ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ addressof addressis addressis addressis addressis firstword 0x10010000 0x10010004 0x10010008 0x1001000c onthis =add.of =add.of =add.of lineisin 1stword 1stword 1stword brackets +4 +8 +c

[0x10010010]0x6f2c2077 0x6f726c64 0x21004865 0x6c6c6f2c ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ addressof addressis addressis addressis addressis firstword 0x10010010 0x10010014 0x10010018 0x1001001c onthis =add.of =add.of =add.of lineisin 1stword 1stword 1stword brackets +4 +8 +c

Byteaddressingexamples:

Thebyteataddress0x1001000dis0x65. Thebyteat address0x1001000eis0x6c.

Datastorageorder:

Inthesourceprogramdeclare.sal(listedbelow),the firstdatadeclarationreservesabyte(withtheASCII value'C'andthenumericalvalue0x43). Therestof thewordis0's,becausethenextdataitemdeclared isaninteger,andthenextvalidwordaddressis 0x10010004. Thevalueoftheintegeris 32,767(base10)=0x00007fff. Thethirddataitemdeclaredisafloating-pointnumber withthevalue3.2767X10^4(itsIEEE-754floating-point representationis0x46fffe00). Thenextdatadeclaration isforanull-terminatedstring,"Hello,world!"Note thatanullbyte,0x00ataddress0x10010019immediately follows(i.e.,terminates)thebytesofthestring. (Andsoon...)

------

LITTLE-ENDIAN(Intel80x86):

Here'sthefirstpartofaSPIMdisplayofthedatasegment oftheprogramdisplay.sal(listedbelow),withextra annotationstoexplainwhat'sgoingon:

DATA [0x10000000]..[0x1000ffff]0x00000000 <-----rangeofBYTE----> ^^^^^^^^ addresses;each | WORDcontains | 0x00000000------

Wordaddresses:

[0x10010000]..[0x1000000f]0x00000043 0x00007fff 0x46fffe00 0x6c6c6548 <-----rangeofbyte---->^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ addresses word word word word addressis addressis addressis addressis 0x10010000 0x10010004 0x10010008 0x1001000c =add.of =add.of =add.of 1stword 1stword 1stword +4 +8 +c

[0x10010010]..[0x1000001f]0x77202c6f 0x646c726f 0x65480021 0x2c6f6c6c <-----rangeofbyte---->^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ addresses word word word word addressis addressis addressis addressis 0x10010010 0x10010014 0x10010018 0x1001001c =add.of =add.of =add.of 1stword 1stword 1stword +4 +8 +c

Byteaddressingexamples:

Thebyteataddress0x1001000dis0x6c. Thebyteataddress 0x1001000eis0x65. NotethatthebyteorderingisDISPLAYED as reversed only for character (single-byte) and string data. (The words at addresses 0x10010004 and 0x10010008 are numeric data, for which the creators of SPIM wisely decided not to reverse the displayed byte ordering). Similarly, the assembled MIPS instructions in the text segment are displayed with the same byte ordering on big-endian and little-endian machines.

------

# SAL program declare.sal # Data segment .data c: .byte 'C' # C equivalent: char c; # c = 'C'; n: .word 32767 # C equivalent: int n; # n = 32767; f: .float 3.2767e4 # C equivalent: float f; # f = 3.2767e4; greet0: .asciiz "Hello, world!" greet: .ascii "Hello, world!" newline: .byte '\n' # Text segment (instructions) .text __start: # Start of arithmetic & logical operations put c # C equivalent: printf("%c\n",c); put newline # put n # C equivalent: printf("%d\n",n); put newline # put f # C equivalent: printf("%6.6f\n",f); put newline # puts greet0 # C equivalent: printf("Hello, world!"); put newline # done # Return control to the OS

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNSIGNED INTEGERS (2)

Unsigned integer arithmetic: . Addition: Hex Binary 4D 01001101 +49 +01001001 96 10010110 . Subtraction: Hex Binary 4D 01001101 3F 00111111 0E 00001110

Subtraction can produce negative numbers (covered in later slides)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNSIGNED INTEGERS (3)

Integer overow: . Occurs when the result of an addition or multiplication has too many digits to be represented in the chosen format; for example, Hex Binary FF 1111 1111 +01 +0000 0001 1 00 1 0000 0000

. The extra digits are discarded, but the hardware recognizes that an overow has occurred

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNSIGNED INTEGERS (4)

Unsigned integer arithmetic: . Multiplication: Decimal Hex Binary 76 4C 1001100 49 31 110001 684 4C 1001100 3040 E40 00000000 3724 E8C 000000000 0000000000 10011000000 100110000000 111010001100

. Only shifts and additions are necessary

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ADDITIVE INVERSE (1)

The additive inverse of a number n is dened as the number m such that m + n =0 . Addition by m undoes (inverts) addition by n . Usually one writes m as m = n . Subtraction is the inverse of addition Simplest way to subtract n is to add n . Problems for computer architects: How to represent n for any positive integer n that can be rep- resented in the computer How to maximize speed

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPARISON OF INTEGER REPRESENTATIONS

UNSIGNED SIGN-MAGNITUDE BIASED TWO'S COMPLEMENT N ÐM M Ð1

Ð0 ÐM M 0 M

0+0ÐM 0

M := 2nÐ1 Ð 1 Arrows show the direction N := 2n Ð 1 of increasing numerical order

Color code: negative numbers positive numbers

c C.D.Cantrell(09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SIGN-MAGNITUDE REPRESENTATION (1)

Bit sequence: n 1 n 2 0 s m

Value assigned to this bit string in the sign-magnitude representation: k =( 1)sm where m is the unsigned integer represented by bits n 2 through 0 Major disadvantage: 000 0 represents +0 and 100 0 represents 0 . Two dierent representations for 0 would require additional hard- ware for numerical comparisons Used in oating-point representations c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SIGN-MAGNITUDE REPRESENTATION (2)

Features of the sign-magnitude representation: . Largest representable postive integer: n 1 kmax,+ = 11 1 =2 1 n 1 bits | {z } . Smallest positive integer represented is 000 01 =1 . Most negative representable integer: n 1 kmax, = 11 1 = (2 1) n 1 bits | {z } . Least negative integer less than zero is 100 01 = 1 Dictionary (lexicographic) ordering: n 1 n 1 +0 1 2 1 0 1 (2 1)

c C. D. Cantrell (01/1999) 3-bit digitization of an analog signal 2 111

110

1 101

100

0 011 voltage 010 digitized values digitized -1 001

000

-2 0.0 2.5 5.0 7.5 10.0 time (sampling times are shown in red)

© C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DIGITIZATION OF AN ANALOG SIGNAL

• How should one map a physical quantity that can be either positive or negative to a finite range of unsigned integer values? This means that one needs an integer data representation in which unsigned integers can represent either positive or negative values One way is to consider one of the non-zero values in the middle of the digital range as representing zero Example: ◦ With n = 3 bits one can represent 7 = 2n − 1 signal levels The analog signal is sampled at regular time intervals The most significant bit is on if the sampled voltage is positive Less significant bits correspond to smaller signal ranges Digitized sampled values vs. time in graph on previous slide:

ts 0 1 2 3 4 5 6 7 8 9 10 V (ts) 011 101 100 010 011 000 010 110 100 011 011

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (1)

In a biased integer representation, the value k assigned to a bit string bn 1 b0 is n 1 j k = B + bj2 jX=0 integer bn 1 b0 | {z } . The integer B is called the bias n 2 1 n 1   . Usually B =   =2 1    2  . Example (n =2 ,B= 1):

k (k + B)10 (k + B)2 1 0 00 0 1 01 1 2 10 . A biased representation is used to represent the exponent in the IEEE-754 oating-point representations c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT BIASED REPRESENTATION

−7 − 8 6 − 0000 0001 5 7 1111 0010 1110 − 0011 6 4

1101 4-bit 0100

biased-7 integer −

3 5

1100 representation

0101

1011 −

2

4 0110

1010

0111

− 1001

1000 1 3

0 2 1

© C. D. Cantrell 1996

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (2)

Addition in a biased integer representation: . Relation between a signed integer k and its unsigned representative u: k = u B u = k + B ⇒ where B is the bias . Let k = u B, k = u B 1 1 2 2 . Sum: k + k =(u + u ) 2B 1 2 1 2 . The unsigned integer that represents k1 + k2 is u = k + k + B = u + u B 1 2 1 2

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (3)

Example of addition in a biased integer representation: . Let k =1,k = 1, B =1 1 2 . The unsigned integers that represent k1 and k2 are

u1 = 10,u2 = 00

. Add the unsigned integer representatives of k1 and k2, and subtract the bias B = 01: 10 +00 10 01 01 . The result, 01, is the unsigned integer that represents 0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (4)

Realistic example of a biased integer representation: . Let n = number of bits = 8 To represent equal numbers of positive and negative integers, 1 choose the bias B equal to the integer part of 2 the maximum unsigned integer (255) that is representable with 8 bits: B = 1 255 = 127 = 01111111 b2 c 10 2 . Numerical value of the integer represented by an 8-bit integer u in the biased-127 representation: k = u B = u 127

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (5)

Biased-127, 8-bit integer representation: . Example of computation of numerical value: u = 11010011 k = 01010100 =84 2 ⇒ 2 10 u = 11010011 B = 01111111 k = 01010100

. Add the biased integer representatives of k1 = 01010100 and k2 = 01010100, and subtract the bias B = 01111111: 11010011 +00101011 11111110 01111111 01111111 . The result, 01111111, is the unsigned integer that represents 0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIASED REPRESENTATION (6)

Properties of an n-bit biased representation with B = 1(2n 1) : b2 c . The dictionary order of the u’s is the same as the numerical order of the k’s: u = 000000002 000000012 011111112 111111112 k = 127 126 0 128 10 10 10 . Positive values of k correspond to biased, unsigned integers u in which the most signicant bit is 1 This is the opposite of the sign-magnitude and two’s-complement representations, in which k is negative if the most signicant bit of u is 1

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ONE’S COMPLEMENT

The one’s complement of an integer m is dened as the number m that results from bitwise negation of the binary representation of m . Example: m = 11010101 m = 00101010 ⇒ . For a representation that uses n bits, m + m = 111 11 =2n 1 n bits | {z } The one’s complement m can represent the additive inverse, m . In the one’s complement representation of signed integers, two dif- ferent bit patterns represent zero: 000 00 and 111 11 n bits n bits | {z } | {z }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TWO’S COMPLEMENT

• The two’s complement of an integer m is defined as the number −mtwo that results from finding the one’s complement of the binary representation and then adding 1  Example: m = 00101010

−mtwo = m + 1 = 11010101 + 1 = 11010110  For a representation that uses n bits,

m +(−mtwo)=m + m + 1 = 111 ···11 +1 n bits = 100 ···00 n+1 bits =2n  Dictionary (lexicographic)ordering: 0 ≺ 1 ≺···≺2n−1 − 1 ≺−(2n−1 − 1) ≺···≺−1 c C. D. Cantrell (09/1999) BINARY-TO-DECIMAL CONVERSION OF TWO'S COMPLEMENT 8-BIT INTEGERS

Example 1: number to be converted ➟ 1 0 1 1 0 0 1 1 − 128 64 32 16 powers of 2 ➟ 8 4 2 1

result = −128+32+16+2+1 = −77

Example 2: number to be converted ➟ 0 0 1 1 0 0 1 1 − 128 64 32 16 powers of 2 ➟ 8 4 2 1

result = 32+16+2+1 = 51

© C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT 2’s COMPLEMENT REPRESENTATION

0 −1 1

2 0000 0001 2 − 1111 0010 1110 3 3 0011 −

1101 4-bit

0100

4

4 two's complement −

1100 representation

0101

5 1011

5

0110 −

1010

0111

6

1001 6 1000

7

7

− 8 ±

© C. D. Cantrell 1996

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TWO’S COMPLEMENT (2)

The two’s complement m can represent the additive inverse, m two . In the two’s complement representation of signed integers, only one bit pattern represents zero: 000 00 n bits | {z } This results in simple hardware for comparing a number to zero . Computation of the additive inverse involves only two steps: Invert Add 1 . Since the ALU must be able to add in any case, since inversion is fast, and since there is only one representation for zero, the two’s complement representation is the best choice for representing signed integers

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TWO’S COMPLEMENT (3)

Two’s complement representation of an integer i: sign of i two’s compl. rep. of i + i 2n i =2n + i ||

Example for n =8: i = 125 i =7D = 0111 1101 10 ⇒|| 16 2 n 2 + i = 1000 00112 = two’s complement representation of 125 10

c C. D. Cantrell (01/1999) Obtain the 2's complement representation of n10

Yes No Is n10 >0?

Convert n 10 Get |n | to base 2 10

Convert |n10| to base 2

Take 1's complement and add 1

Done!

c C. D. Cantrell (03/1998) Obtain n10 from its 2's complement representation

Yes No Is the sign bit >0?

Take 1's complement and add 1

Convert Convert ntwo to base 10 to base 10

Put a Ð sign in front

Done!

c C. D. Cantrell (03/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BASE-64 ENCODING

An encoding method that uses the 64 ASCII characters ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ to represent the 26 = 64 possible 6-bit binary strings . Used to encode binary data for transmission through text-only communication channels (such as many electronic mail programs) . Uuencoding is more common because it uses bytes (8 bits) . Method: The least common multiple of 6 and 8 is 24 3 data bytes are used to ll a 24-bit buer A 24-bit string is encoded as 4 ASCII characters Each 6-bit string represents an unsigned integer k (0 k 63) The 6-bit string with value k is represented by the kth ASCII character in the list above (000000 A, ..., 111111 /) 7→ 7→

c C. D. Cantrell (01/1999) Silicon ingot Blank wafers

Slicer 20 to 30 processing steps

Tested Individual dies Patterned wafers dies (one wafer)

Bond die to Die Dicer package tester

Packaged dies Tested packaged dies

Part Ship to customers tester THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PATTERNED WAFERS

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PACKAGED DIE

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL VARIABLES

A logical variable can have only two values, which are interpreted as “true” and “false” . Correspondence between bits and logical values: 1 = “true”, 0 = “false” . Correspondences between electrical signals and logical values: Positive logic (an active-high signal): high voltage = “true”, low voltage = “false” Negative logic (an active-low signal): low voltage = “true”, high voltage = “false” A signal set to logical 1 is said to be asserted or active or true A signal set to logical 0 is said to be deasserted or negated or false

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS

To build a computer, we must have: . A hardware representation of binary data Bits = signal levels! High or low voltages are mapped to 0 or 1 . Ways to do arithmetic Binary arithmetic = logic! The additive inverse is a logic function (bitwise negation) followed by addition: m = m +1 two A 1-bit can be designed from Boolean logic equations: s = a b c + a b c + a b c + a b c i i i i c = a b + a c + b c o i i To multiply, we just shift and add

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (1)

Inputs and outputs are “high” or “low” voltages Building blocks: . Combinational logic circuits Output depends only on input No feedback or memory Implement truth tables Examples: AND, OR, NAND, NOR, XOR, XNOR gates . Sequential logic circuits Have internal state (memory elements or feedback loops) Normally controlled by a clock signal Examples: Flip-ops, registers, DRAM cells

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC GATES (1)

A logic gate is an electronic circuit that implements one of the basic logical functions (AND, OR, NOT) . Truth tables: a b a + b a b a a 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC GATES (2)

AND, OR, and NOT can be implemented using . Mechanical devices . Mechanical switches . Electromechanical switches . Transistors Why did transistors win out? . Moore’s “law”: The number of transistors on a microprocessor die doubles, and the cost per transistor halves, every 18–24 months Dr. Gordon Moore is the founder of Intel

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MOORE’S LAW (1)

http://www.synopsys.com/news/pubs/present/cicc g1.gif THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MOORE’S LAW (2)

http://www.physics.udel.edu/wwwusers/watson/scen103/intel-new.gif The MOS Transistor

Gate Oxide Gate Polysilicon Source Drain Field-Oxide (SiO2) n+ n+

p+ stopper p-substrate

Bulk Contact

CROSS-SECTION of NMOS Transistor

Devices Digital Integrated Circuits © Prentice-Hall 1995 Cross-Section of CMOS Technology

Devices Digital Integrated Circuits © Prentice-Hall 1995 MOS transistors - types and symbols

D D D D

G G G G B

S S S S NMOS NMOS PMOS Enhancement Depletion Enhancement

Devices Digital Integrated Circuits © Prentice-Hall 1995 Threshold Voltage: Concept

+ S V D GS G -

n+ n+

n-channel Depletion Region p-substrate

B

Devices Digital Integrated Circuits © Prentice-Hall 1995 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “AND”

• The AND function has the value “true” when both of its inputs (arguments) are “true” Two arguments (inputs) a, b One output: a b a AND b 0 0 0 0 1 0 1 0 0 1 1 1

Boolean notation: AND(a, b)=a · b A logical product, equal to the numerical product of two bits a a и b An and gate implements the AND function: b

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “AND” USING SWITCHES

A AND B +5V

A

B

0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TRANSISTORS AS SWITCHES (1)

http://tech-www.informatik.uni-hamburg.de/applets// THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TRANSISTORS AS SWITCHES (2)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL NEGATION

The operation of negation or inversion changes the value of a logical variable . The name of the negation operator is NOT . The complement of a logical variable a is the logical variablea such that: a is true when a is false a is false when a is true Then:a = NOT(a) . An inverter gate implements 1-bit negation:

a NOT(a)=a

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOT” USING SWITCHES

NOT A +5V

1 Output (A')

A (closed=1, open=0)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOT” USING TRANSISTORS (1)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOT” USING TRANSISTORS (2)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOT” USING TRANSISTORS (3)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOT” USING TRANSISTORS (4)

http://www.mentorg.com/dsm/images/Moores law/xcal inside.html THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “NAND”

• The NAND function has the value “false” when both of its inputs (arguments) are “true”; otherwise its value is “true” Two arguments (inputs) a, b One output: a b a NAND b 0 0 1 0 1 1 1 0 1 1 1 0

Boolean notation: NAND(a, b)=a · b a a и b A nand gate implements the NAND function: b

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NAND” USING TRANSISTORS

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “OR”

• The OR function has the value “true” when at least one of its inputs (arguments) is “true” Two arguments (inputs) a, b One output: a b a OR b 0 0 0 0 1 1 1 0 1 1 1 1

Boolean notation: OR(a, b)=a + b A logical sum,not an arithmetic sum! a a + b An or gate implements the OR function: b

c C.D.Cantrell(09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “OR” USING SWITCHES

A OR B +5V

A B

0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “NOR”

• The NOR function has the value “false” when at least one of its inputs (arguments) is “true” Two arguments (inputs) a, b One output: a b a NOR b 0 0 1 0 1 0 1 0 0 1 1 0

Boolean notation: NOR(a, b)=a + b (= NOT(a + b))

a a + b A nor gate implements the NOR function: b

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOR” USING TRANSISTORS (1)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTING “NOR” USING TRANSISTORS (2)

http://tech-www.informatik.uni-hamburg.de/applets/cmos/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “XOR”

• The XOR (exclusive OR) function has the value “true” when one and only one of its inputs (arguments) is “true” Two arguments (inputs) a, b One output: a b a XOR b 0 0 0 0 1 1 1 0 1 1 1 0

Boolean notation: XOR(a, b)=a ⊕ b a a ⊕ b An xor gate implements the XOR function: b

c C. D. Cantrell (09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL “XNOR”

• The XNOR (or coincidence) function has the value “true” when both of its inputs (arguments) are the same Two arguments (inputs) a, b One output: a b a XNOR b 0 0 1 0 1 0 1 0 0 1 1 1

Boolean notation: XNOR(a, b)=a b

a a b An xnor gate implements the XNOR function: b

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTIPLEXORS (1)

2-to-1 multiplexor: . 3 inputs: s, a, b . 1 output: m If s =0,m = a If s =1,m = b Truth table: s a b m 0 0 don’t care 0 0 1 don’t care 1 1 don’t care 0 0 1 don’t care 1 1

Logic equation: m = s a + s b

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTIPLEXORS (2)

2-to-1 multiplexor . Selector input is s =0 . Output is m = a

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTIPLEXORS (3)

2-to-1 multiplexor . Selector input is s =1 . Output is m = b

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTIPLEXORS (4)

2-to-1 multiplexor . If selector input is s = 0, then b is blocked . If selector input is s = 1, then a is blocked

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DECODERS (1)

An n-to-2n decoder is a combinational logic circuit that: . Has n inputs and 2n outputs . Converts from a binary representation of a number m to a logical 1 on output m and logical 0 on all other outputs (0 m 2n 1) Every string of n bits corresponds to a unique pattern of logical 1’s and 0’s on the n inputs Every string of n bits is the binary representation of a unique unsigned integer, m Example: 2-bit decoder (b , b = 1 (asserted) or 0 (deasserted)) 0 1 m =2b1 + b0 (base 10) Logic equations: m0 = b1b0,m1 = b1b0,m2 = b1b0,m3 = b1b0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DECODERS (2)

2-bit decoder . Inputs are b0 =0,b1 =0

. Outputs are m0 =1,m1 =0,m2 =0,m3 =0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DECODERS (3)

2-bit decoder . Inputs are b0 =1,b1 =0

. Outputs are m0 =0,m1 =1,m2 =0,m3 =0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DECODERS (4)

2-bit decoder . Inputs are b0 =0,b1 =1

. Outputs are m0 =0,m1 =0,m2 =1,m3 =0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DECODERS (5)

2-bit decoder . Inputs are b0 =1,b1 =1

. Outputs are m0 =0,m1 =0,m2 =0,m3 =1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

n-BIT DECODER

n-bit decoder . Address decoding for memory locations, registers or I/O devices Each memory location (or register, or I/O port or device) is assigned a unique n-bit address The CPU or I/O controller accesses the target by sending the address over n parallel signal lines The decoder activates one of 2n select lines to access the location or device Example: SCSI-2 bus Maximum of 8 = 23 devices, numbered from 0 to 7 The SCSI controller asserts one of 8 lines, depending on the 3-bit address . Can also be used to generate minterms in logic circuits

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

APPLICATION OF DECODER TO REGISTER ADDRESSING

Write enable C 0 Register 0 1 D n-to-2n C Register number decoder Register 1 (n bits) D 2n Ð 2 2n Ð 1

C Register 2n Ð 2 D C Register 2n Ð 1 Register data D

Each register has an “enable” input (labeled C in the gure) . A register’s enable input must be asserted in order for data to be written to the register through the “data” input (labeled D) . The enable input is controlled by an AND gate . Both the signal from the decoder and the “write enable” signal must be asserted in order for the register’s enable input to be asserted

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TRUTH TABLE FOR 1-BIT HALF ADDER

Half adder: . 2 inputs: a, b

. 2 outputs: s (Sum), co (CarryOut) Truth table: a b co s co term s term 0 0 0 0 1 0 0 1 ab 0 1 0 1 ab 1 1 1 0 ab

Logic equations from truth table (rows in which output = 1): s = a b + a b c = a b o

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-BIT HALF ADDER (1)

Inputs on left are a =0,b =0 Inputs on right are a =1,b =0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-BIT HALF ADDER (2)

Inputs on left are a =0,b =1 Inputs on right are a =1,b =1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TRUTH TABLE FOR 1-BIT FULL ADDER

One-bit full adder: . 3 inputs: a, b, ci (CarryIn — from less signicant bit)

. 2 outputs: s (Sum), co (CarryOut — to more signicant bit) Truth table: a b ci co s co term s term 0 0 0 0 0 1 0 0 0 1 abci 0 1 0 0 1 abci 1 1 0 1 0 abci 0 0 1 0 1 abci 1 0 1 1 0 abci 0 1 1 1 0 abci 1 1 1 1 1 abci abci Logic equations are on a later slide c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-BIT FULL ADDER

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REAL ADDERS DON’T USE GATES

Gate-level design uses too many transistors . Some operations that need to be performed in order to reproduce the truth table for a gate have to be undone by the transistors in the next gate . For a 1-bit full adder, the gate-level design on the previous slide requires 66 transistors . The transistor-level design on the following slide requires only 28 transistors

c C. D. Cantrell (05/1999) FULL ADDER (28 TRANSISTORS)

VDD VDD ci a b a b a b ci b VDD a X ci ci a s ci a b b VDD a b ci a

co b

s = a ⋅ b ⋅ ci + a ⋅ b ⋅ ci + a ⋅ b ⋅ ci + a ⋅ b ⋅ ci

co = a ⋅ b + a ⋅ ci + b ⋅ ci

after Jan Rabaey, Digital Integrated Circuits

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (9)

Intuitive approach to constructing a one-bit full adder from one-bit half adders: . Add the inputs a and b using a half adder Outputs of this stage: s0 = a b + a b c0 = a b . Add the CarryIn bit ci to s0 using another half adder Outputs of this stage: s =(a b + a b) c + (a b + a b) c i i c00 = c (a b + a b) i . Set the CarryOut bit co if either c0 or c00 is set: co = c0 + c00

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (10)

Logic equations for a one-bit full adder from the truth table: . 4 input combinations give true outputs for Sum (s) . SOP logic equation for s: s = a b c + a b c + a b c + a b c i i i i . 4 input combinations give true outputs for CarryOut (co)

. SOP logic equation for co: c = a b + a c + b c o i i

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE ALU

ALU = Arithmetic and Logical Unit . The ALU is a combinational logic block that executes integer arithmetic and logical instructions . The simple ALU of Patterson & Hennessy, Chapter 4 supports only the and, or, add, sub and slt instructions We design a 1-bit ALU from OR and AND gates, a full adder, and control circuitry We design a 32-bit ALU using 32 1-bit ALUs in parallel . Making it fast is not easy because the output of the ALU for bit n depends on the carry signal generated for bit n 1 . Carry-select, carry-lookahead and carry-forwarding adders are faster, but costly

c C. D. Cantrell (05/1999) SIMPLE 1-BIT ALU Operation CarryIn

a 0

1 Data signals in blue Result Control signals in red

2 b

CarryOut THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BIT SLICING

How to design a 32-bit ALU using 32 1-bit ALUs Concept: Operate on all bits at once using a separate 1-bit ALU for each bit (this slices the word into 32 bits, hence the name) . This works for and and or . For add, we have to propagate carries from the least signicant bits to the most signicant bits DATA ci co 1101

CONTROL BIT 3 BIT 2 BIT 1 BIT 0 +1001 10111 c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT RIPPLE CARRY ADDER

thanks to Ulf Dittmer ([email protected]) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (1)

With innite money, one could just design a circuit that implements a 32-bit adder in 2 levels of logic . Real n-bit adders are designed to approximate a delay of order log n, instead of a delay of order n Carry-select adders . Basic idea: Make two copies of each m-bit adder that is used to make an adder that handles a multiple of m bits, and perform the sum assuming that ci = 0 in one copy and ci = 1 in the other copy . Then select the correct output from each adder using a multiplexor Use the output of the adder that handles the lowest m bits as the control signal for the multiplexor that selects the output for the next m bits . The carry-select delay is asymptotically of order √n, but a carry- select adder uses a much larger area than a ripple-carry adder

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT CARRY SELECT ADDER

1 0 0 1 0 1 0 A

1 0 1 0 0 1 B 0

Vcc 1 1 2 0

1 0 0 Mux 0 0 Switch 0

0 1 0 0 Mux 0 Switch

0 1 0 0 Mux 0 Switch Carry

thanks to Ulf Dittmer ([email protected]) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (2)

Carry-lookahead adders . At bit position k, adding the input bits ak and bk may: Generate a carry bit if a = b =1 k k Propagate a carry bit (c = c )ifa =1,orb = 1, or both o,k i,k k k . Dene new Boolean functions called “generate” (gk) and “propagate” (pk) Generate: g = a b k k k Propagate: pk = ak + bk Carryin to next bit in terms of carryin, generate and propagate: c = c = g + p c i,k+1 o,k k k i,k Simplify notation: ck+1 = gk + pk ck c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (3)

Carry-lookahead adders . Recursive equation for ck+1: c = g + p c k+1 k k k . Solution for a 4-bit adder: c = g + p c 1 0 0 0 c = g + p g + p p c 2 1 1 0 1 0 0 c = g + p g + p p g + p p p c 3 2 2 1 2 1 0 2 1 0 0 c = g + p g + p p g + p p p g + p p p p c 4 3 3 2 3 2 1 3 2 1 0 3 2 1 0 0 . Delay is now (2 gate delays + 1 full adder delay) = 4 gate delays instead of 4 full adder delays = 8 gate delays . This approach leads to very complicated equations (and circuits) for more than 4 bits

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (4)

Partial carry-lookahead adders: . Use a 4-bit or 8-bit carry-lookahead adder as a building block We’ll use a 4-bit adder . One 32-bit adder = 8 4-bit adders One 64-bit adder = 4 16-bit adders = 4 blocks of 4 4-bit adders . At each level of aggregation above 4 bits, we can: Use ripple-carry of lower-level ripple-carry blocks (slowest!) Use ripple-carry of lower-level carry-lookahead blocks Use carry-lookahead of lower-level ripple-carry blocks Use carry-lookahead of lower-level carry-lookahead blocks This approach gives a delay of order log n 2

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (5)

Equations for propagate and generate of 4-bit adder blocks: P = p p p p ,P= p p p p , 0 0 1 2 3 1 4 5 6 7 P = p p p p ,P= p p p p 2 8 9 10 11 3 12 13 14 15 G = g + p g + p p g + p p p g 0 3 3 2 3 2 1 3 2 1 0 G = g + p g + p p g + p p p g 1 7 7 6 7 6 5 7 6 5 4 G = g + p g + p p g + p p p g 2 11 11 10 11 10 9 11 10 9 8 G = g + p g + p p g + p p p g 3 15 15 14 15 14 13 15 14 13 12 Equations for carryout of 4-bit adder blocks: C = G + P c 1 0 0 0 C = G + P G + P P c 2 1 1 0 1 0 0 C = G + P G + P P G + P P P c 3 2 2 1 2 1 0 2 1 0 0 C = G + P G + P P G + P P P G + P P P P c 4 3 3 2 3 2 1 3 2 1 0 3 2 1 0 0 c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (6)

Example of partial carry-lookahead addition of two 16-bit numbers: a: 0001 1010 0011 0011 b: 1110 0101 1110 1011

gi: 0000 0000 0010 0011 pi: 1111 1111 1111 1011 Pi:1110 Gi:0010 C4: 1

. The result C4 = 1 follows from the equation C = G + P G + P P G + P P P G + P P P P c 4 3 3 2 3 2 1 3 2 1 0 3 2 1 0 0 . The minterm P3 P2 G1 = 1; hence C4 =1,i.e., this addition does produce a carryout in the most signicant bit

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FAST ADDERS (7)

Delay of a 16-bit partial carry-lookahead adder made up of 4 4-bit carry-lookahead adders: . Let T = (1 gate delay)

. Computation of C4 is done in 2 levels of logic in terms of Pi and Gi 2T delay ⇒ . Computation of Pi is done in 1 level of logic in terms of pi 1T delay ⇒ . Computation of Gi is done in 2 levels of logic in terms of pi and gi 2T delay ⇒ . Total: 5T delay for carry-lookahead vs. 32T delay for ripple-carry

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SUBTRACTION

How to support 32-bit integer subtraction using only an adder and an inverter:

. A 1-bit full adder produces carryout (co) and sum (s) outputs, given carryin (ci) and numerical (a and b) inputs . Since (a b) = a +( b) = a + b +1 two two we can subtract using the adder if we: Obtain b from b using an inverter Set the carryin input (c )ofbit0to1 i . To select subtraction, we need two control signals: BInvert (selects b instead of b) CarryIn (sets c = 1 for bit 0) i We can combine these into one control signal, Bnegate

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALU CONTROL SIGNALS (1)

Three signals are necessary for the simple ALU of P & H, Chapter 4 . Bnegate: Asserts Binvert and CarryIn . Operation: Selects the output signal 0 for and, 1 for or or slt,2 for add or sub 10 ALU Control Signals Bnegate Operation MIPS b2 b1 b0 Instructions 0 0 0 and 0 0 1 or 0 1 0 add 1 1 0 sub 1 1 1 slt

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-BIT ALU SCHEMATIC DIAGRAM

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALU CONTROL SIGNALS (2)

Functions of the ALU control signals for 1 bit out of 32 or 64: . The value of b1b0 determines which device’s output is selected b b = 00 selects the AND output 1 0 b b = 01 selects the OR output 1 0 b b = 10 selects the SUM output of the adder 1 0 b b = 11 selects the Less input 1 0 Less is asserted only in the LSB, and only if the MSB is 1 . Bnegate selects between addition and twos complement subtraction If Bnegate is deasserted, the output of the adder is a + b If Bnegate is asserted, the adder computes a + b +1: The lower 2–1 MUX selects b The upper 2–1 MUX sends the (asserted) Bnegate signal to the carry-in input of the adder (this is +1)

thanks to Ulf Dittmer ([email protected]) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (2)

• Important parameters of a logic circuit: Cost ◦ Proportional to number of gates  Exclude NOTs (they’re cheap) Power dissipation ◦ ∝ 2 Average power CavVDDfNg where  Ng = number of gates, Cav = average capacitance per gate, VDD = supply voltage, f = frequency at which gates are switched Maximum delay ◦ Proportional to maximum number of gates through which a signal may pass, going from input to output ◦ Imposes a lower limit on clock period ⇒ upper limit on clock frequency ⇒ upper limit on performance

c C.D.Cantrell(09/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (3)

Methods for minimizing gate count: . Karnaugh maps Not really useful for more than 5 logical variables Academic teaching tool . Boolean algebra Industrial solution Performed by software . Quine-McCluskey tabular method Worst-case time required to minimize a Boolean expression involving n variables is proportional to 2n

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (1)

Algebraic properties of functions of logical variables . Every Boolean function can be implemented by a combinational logic circuit . Every combinational logic circuit implements a Boolean function Fundamental Boolean operations are + (OR), (AND) and NOT Approaches: . Axiomatic Algebraic proofs . Practical Truth-table demonstrations

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (2)

AXIOMS AND BASIC THEOREMS

Description OR form AND form Axiom 2 x +0=x x 1=x Axiom 3 x + y = y + x x y = y x Axiom 4 x (y + z)=(x y)+(x z) x + y z =(x + y) (x + z) Axiom 5 x + x =1 x x =0 Theorem 1 x + x = x x x = x Theorem 2 x +1=1 x 0=0 Theorem 3 x = x Associativity x +(y + z)=(x + y)+z x (y z)=(x y) z Absorption x + x y = x x (x + y)=x DeMorgan’s laws x + y = x y x y = x + y

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (3)

Verication of Axiom 3 (Commutativity): x y x + y y + x 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1

x y x y y x 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (4)

Illustration of Axiom 3 (Commutativity):

x y x и y = y и x y x

x y x + y = y + x y x

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (5)

Verication of Axiom 4 (Distributivity of AND): x y z y + z x (y + z) x y x z (x y)+(x z) 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (6)

Illustration of Axiom 4 (Distributivity of AND over OR):

x y x x и (y + z) = (x и y) + (x и z)

y x z z

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (7)

DeMorgan’s laws . How to interchange NOT with AND or OR . OR form: x + y = x y . AND form: x y = x + y . Truth table: x y x + y x y x y x + y 0 0 1 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (8)

OR form of DeMorgan’s laws: x x + y = x y y и

AND form of DeMorgan’s laws: x x и y = x + y y

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (9)

A Boolean function f : 0, 1 n 0, 1 { } →{ } . maps n inputs x1, ... , xn to one output y, . takes only the values 0 or 1, and . accepts only inputs that take the values 0 or 1 Truth-table representation: x x x y = f(x , ... , x ) term 1 2 n 1 n 0 0 1 0 0 1 ... 0 1 x1 x2 xn . . . . . 0 1 1 1 x1 x2 xn 1 1 1 0 Algebraic representation (for example, as a sum of products): f(x ,...,x)=x x x + + x x x 1 n 1 2 n 1 2 n c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (10)

A Boolean function that is expressed in algebraic form can be simpli- ed using the axioms and theorems of Boolean algebra . Example 1: OR form of the Absorption Theorem x + x y = x 1+x y = x (1 + y)=x 1=x . Example 2: x + x y = x + x y + x y = x x + x y + x x + x y =(x + x) (x + y)=1 (x + y) = x + y . Example 3: x (x + y)=x x + x y = x + x y = x

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (11)

The Consensus Theorem: x y + x z + y z = x y + x z . Intuitively obvious, for if y z is true, then both y and z are true, and therefore either x y is true or x z is true . For an algebraic proof, introduce x + x = 1 into the last term: x y + x z + y z = x y + x z + y z (x + x) =[(x y)+(x y) z]+[(x z)+(x z) y] = x y + x z (The last line follows from the Absorption Theorem)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (11)

Logic equations for a one-bit full adder, considered as a combination of two half adders: . Sum (s): s =(a b + a b) c + (a b + a b) c i i =(a b + a b) c + (a b) (a b) c i i =(a b + a b) c +(a + b) (a + b) c i i =(a b + a b) c +(a a + a b + a b + b b) c i i =(a b + a b) c +(a b + a b) c i i . Agrees with SOP equation for s derived from truth table

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (12)

Logic equations for a one-bit full adder, considered as a combination of two half adders:

. For CarryOut (co), use a b = a b + a b c i (the Absorption Theorem) twice: c = a b c + a b c + a b o i i = a b c + a b c + a b + a b c i i i = a (b + b) c + a b c + a b i i = a c + a b c + a b i i = a c + a b c + a b + a b c i i i = a c +(a + a) b c + a b i i = a c + b c + a b (agrees with truth-table result) i i

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (4)

Every Boolean function can be implemented with only 2 levels of logic . One level of gates is ANDs . The other level is ORs . If each input is not available in both direct and complemented form, a level of inverters may be necessary . Example: PLA (programmable logic array)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGIC CIRCUITS (5)

Canonical sum of products (SOP) form . In each product, every variable appears once & only once (either in direct or in complemented form) (minterms) 2 levels of logic minimal delay ⇒ If both direct and complemented inputs are not available, may need another level of logic for negation Canonical product of sums (POS) form . In each sum, every variable appears once & only once (either in direct or in complemented form) (maxterms) . Less common than SOP form

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (12)

A product (multiple AND) x1x2 xn is called a minterm because it takes the value “true” a minimal number of times (for one and only one set of values of x1, ... , xn) . The minterm x x x is realized as an AND gate with n inputs 1 2 n . In a sum-of-minterms representation of a Boolean function, No. of logical variables in each minterm = no. of inputs to the equivalent AND gate No. of terms in sum = no. of AND gates in 1st level of logic = no. of inputs to 2nd level OR

A sum (multiple OR) x1 + x2 + + xn is called a maxterm because it takes the value “true” a maximal number of times (whenever at least one of x1, x2,...,xn is true)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BOOLEAN ALGEBRA (13)

Truth-table representation of the sum of minterms f(x, y, z)=x y z + x y z = m(3, 4): X x y z f(x, y, z) Label 0 0 0 0 0 0 0 1 0 1 0 1 0 0 2 0 1 1 1 3 1 0 0 1 4 1 0 1 0 5 1 1 0 0 6 1 1 1 0 7 . Each minterm (row) is labeled by the string of values of x, y, z, considered as an unsigned 3-bit integer

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (1)

A Karnaugh map is a pictorial representation of a truth table . Purpose: Make it easier to transform a logic function to a minimal sum of products . Used to: Reduce the number of gates in the critical path Reduce the number of (and number of inputs to) 1st level gates . Advantage: Uses humans’ ability to interpret visual information . Disadvantages: Visual information is not easy for computers to analyze Not useful for more than ve variables Not used in industry Essentially an academic teaching tool

c C. D. Cantrell (05/1999) Karnaugh map for logical “and”, viewed as a bar chart

1

0.8 y

⋅ 0.6

0.4

0.2 f(x,y) = x 1 0 y 0 0 x 1

c C. D. Cantrell (05/1999) Two-Variable Karnaugh map Minterms in x,y

y x 0 1 m m 0 0 1 x 00 01 m m 1 2 3 10 11 x y y

c C. D. Cantrell (05/1999) Three -Variable Karnaugh map Minterms in x,y,z

z y z x z z 00 01 11} 10

0 m0 m1 m3 m2 x 000 001 011 010 m m m m 1 4 5 7 6 x } 100} 101 111 110

y y

c C. D. Cantrell (05/1999) Four -Variable Karnaugh map Minterms in w,x,y,z

z y z z z

w x 00 01 11} 10

m m m m 00 0 1 3 2 x 0000 0001 0011 0010 w }

01 m4 m5 m7 m6

0100 0101 0111 0110 x

m m m m

11 12 13 15 14 } 1100 1101 1111 1110 w } m m m m 8 9 11 10 x

10 } 1000 1001} 1011 1010

y y

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (2)

Rules for assigning minterms to cells in Karnaugh maps: . Each cell corresponds to one minterm (one AND of all of the logical variables or their complements) Example: Cell 3 in a 4-variable Karnaugh map corresponds to the minterm w x y z . For each cell and its minterm, the values of the logical variables are written as a binary integer, the bits of which are the values of the logical variables for which the minterm is true Example: For cell 3 in a 4-variable Karnaugh map, the binary integer is 0011, corresponding to w =0,x =0,y =1,z =1 . Only one bit changes when one goes to an adjacent cell Example: Cells 3 (0011) and 11 (1011) are adjacent . The top and bottom of the map are adjacent . The right and left sides of the map are adjacent

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (3)

Rules for displaying a Boolean function: . Put a 1 in each cell for which the function is true Example: For the Boolean function f(w, x, y, z)= m(4, 5, 12, 13) = Xw x y z + w x y z + w x y z + w x y z, put 1’s in the cells numbered 4, 5, 12, and 13 (see next slide), because the function is true when any one of the following minterms is true: m = w x y z, m = w x y z, m = w x y z, m = w x y z 4 5 12 13 . Do not put 0’s in the cells for which the function is false

c C. D. Cantrell (05/1999) Four -Variable Karnaugh map f(w,x,y,z) = Σ m(4,5,12,13) = x⋅ y

z y z z z

w x 00 01 11} 10

00 x 0000 0001 0011 0010 w }

01 11

0100 0101 0111 0110 x

11 11 } 1100 1101 1111 1110 w } x

10 } 1000 1001} 1011 1010

y y

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (4)

Rules for nding prime implicants of a Boolean function f: . Circle only cells that contain 1’s Each circled group of cells corresponds to one implicant of f Example: In the preceding slide, the circled cells are 4, 5, 12, and 13, corresponding to the implicant x y . The number of cells enclosed in one circle must be a power of 2 (1, 2, 4, 8, or 16) Example: In the preceding slide, 22 = 4 cells are circled . To nd a prime implicant, circle the maximum number of cells consistent with the above rules The more cells a given circle encloses, the smaller the number of logical variables needed to specify the implicant . In the preceding slide, the original sum of minterms 4, 5, 12 and 13 would require 4 4-input ANDs and one 4-input OR . The simplied expression x y requires only one 2-input AND! c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (5)

Denitions for Karnaugh maps: . Minimal sum: Expression of a Boolean function f as a sum of products such that No sum of products for f has fewer terms Any sum of products with the same no. of terms has at least as many logical variables . A Boolean function g implies a Boolean function f if & only if for every set of input values such that g is 1, then f is 1 . Prime implicant of a Boolean function f: A product term that implies f, such that if any logical variable is removed from the product, then the resulting product does not imply f . Prime implicant theorem: A minimal sum is a sum of prime implicants

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (6)

More denitions and theorems for Karnaugh maps: . Distinguished 1-cell: A combination of inputs to a Boolean func- tion that is implied by one & only one prime implicant . Essential prime implicant of a Boolean function f: A prime implicant that covers one or more distinguished 1-cells . Every minimal sum for a Boolean function f must include all of the essential prime implicants of f imply f . Steps in simplifying a Boolean function: Determine the distinguished 1-cells and the prime implicants that imply them Find a minimal set of prime implicants that imply that f is true for the input combinations that are not implied by essential prime implicants

c C. D. Cantrell (05/1999) Three -Variable Karnaugh map F (x,y,z) = Σm(0,2,3,7)

z y z x z z 00 01 11} 10 0 1 1 1 x 0 132 1 1 x 4 576 } }

y y

F (x,y,z) = x⋅ z + y⋅ z THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (7)

Simplication of the Karnaugh map on the previous slide: .m0 and m7 are the distinguished 1-cells The essential prime implicant that implies m is x y 0 x y is a prime implicant because removing x leaves only y, which does not imply f, and removing y leaves only x, which also does not imply f The essential prime implicant that implies m is y z 7 . Another prime implicant is x y, which covers m and m 2 3 This prime implicant is not essential

c C. D. Cantrell (05/1999) Four -Variable Karnaugh map Values of F(w,x,y,z ) = Σm(1,5,7,9,13,15)

z z y z y z z z z z }

w x 00 01 11} 10 w x 00 01 11 10

00 1 x 00 1 x 0132 0132 } w } w

01 11 01 11

4576 x 4576 x

11 1 1 11 1 1 12 13 15 14 } 12 13 15 14 } } w } w 1 x 1 x 10 10 8 9 11 10

8 9 11 10 } }

} }

y y y y WRONG (yz+xyz) RIGHT (yz+xz) xyz is not a prime implicant! THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (8)

Simplication of the Karnaugh map on the previous slide: . The distinguished 1-cells are m1, m7, m9 and m15 x y z is not a prime implicant Removing y leaves x z, which also implies f m is covered by the prime implicant y z 1 The implicant x y z also covers m1, but it is not a prime implicant . In this example, both of the prime implicants are essential

c C. D. Cantrell (05/1999) Prime -number selector Values of F(w,x,y,z ) = Σm(1,2,3,5,7,11,13) = w⋅z + w⋅x⋅y + x⋅y⋅z + x⋅y⋅z z y z z z

w x 00 01 11} 10

00 111x 0132 w }

01 11 4576 x

11 1 12 13 15 14 } w } 10 1 x

8 9 11 10 } }

y y THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAPS (9)

Simplication of the Karnaugh map on the previous slide: . The distinguished 1-cells are m1, m2, m7, m11 and m13

m7 is covered by both w z and w x z, but w x z is not a prime implicant . In this example, all of the prime implicants are essential

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (1)

A hazard is a condition in a logically correct digital circuit or com- puter program that may lead to a logically incorrect output Static hazards: Output should stay constant, but doesn’t . Static 1 hazard: Output should be a constant 1, but when one input is changed drops to 0 and then recovers to 1 Cannot occur in a POS implementation . Static 0 hazard: Output should be a constant 0, but when one input is changed rises to 1 and then drops back to 0 Cannot occur in a SOP implementation

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (2)

Why do hazards matter? . The output of a hazard-prone circuit or program depends on con- ditions other than the inputs and the state . The signal passed to another circuit by a hazard-prone circuit de- pends on exactly when the output is read . In edge-triggered logic circuits, a momentary glitch resulting from a hazard can be converted into an erroneous output

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (3)

• The circuit below forx ¯y¯ + yz has a static 1 hazard If the input y is changed from 0 to 1, control of the output of the OR gate shifts from one AND gate to the other Any difference in delays between the two AND gates will result in a glitch in the output of the OR

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (4)

• The timing diagram below shows the inputs and outputs of a circuit forx ¯y¯ + yz with a static 1 hazard

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (5)

• Static 1 hazard detection using a Karnaugh map: Reduce the logic function to a minimal sum of prime implicants A Karnaugh map that contains adjacent, disjoint prime implicants is subject to a static 1 hazard ◦ Adjacent prime implicants: Only one variable needs to change value to move from one prime implicant to the other ◦ Disjoint prime implicants No prime implicant covers cells of both of the disjoint prime implicants Correspond to AND gates that must both change their outputs when a particular input is changed

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

Hazard detection F (x,y,z) = Σm(0,1,3,7)

z y z x z z 00 01 11} 10 0 1 1 1 x 0 132 1 11 x 4 576 } }

y y

F (x,y,z) = xy + yz

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

Hazard elimination

z y z x z z 00 01 11} 10 0 1 1 1 x 0 132 1 11 x 4 576 } }

y y

F (x,y,z) = xy + yz + xz

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (6)

• The circuit below forx ¯y¯ + yz has no static 1 hazard

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HAZARDS (7)

• The timing diagram below shows the inputs and outputs of a revised circuit forx ¯y¯ + yz with no static 1 hazard

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SYNCHRONOUS vs. ASYNCHRONOUS LOGIC

Asynchronous logic: . No single master clock signal . Examples: Flip-ops, latches Buses (clock frequency dierent from CPU’s clock frequency) . In principle, may be faster than synchronous logic Synchronous logic: . Controlled by a master clock signal . State changes are correlated & occur at known times . Easier to design & more reliable than asynchronous logic

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LIMITS ON CLOCK FREQUENCY (1)

A clock is a free-running signal with a xed frequency 1 c = ,Tc = clock period Tc Example: Tc =2.5 ns, c = 500 MHz

State State element Combinational logic element 1 2

Clock cycle

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LIMITS ON CLOCK FREQUENCY (2)

Delays that determine the minimum clock period (maximum clock frequency):

. Transport delay through sequential logic blocks, t

. Time for signals to settle in combinational logic blocks, comb

. Inertial delay (setup time), su

. Clock skew, c

. Clock period Tc must satisfy

Tc > t +comb +su +c

D QDQ Combinational Flip-flop Flip-flop logic block C C

tprop tcombinational tsetup

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LIMITS ON CLOCK FREQUENCY (3)

Clock skew . Finite signal propagation velocity (

D Q Combinational D Q Flip-flop logic block with Flip-flop Clock arrives C delay time of ∆ Clock arrives C at time t after t + ∆

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LIMITS ON CLOCK FREQUENCY (4)

Heat dissipation . On chip Danger: Thermal runaway Example: Emitter-coupled logic (ECL) is faster than CMOS logic, but dissipates much more heat Smaller feature size lower operating voltage less heat dissipation ⇒ ⇒ . On board Example: Failure of beta Cray Y-MP oating-point boards under heavy use Cost . The fastest parts are usually the scarcest & most expensive

c C. D. Cantrell (05/1999) Real World Examples

Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417

From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

cs252 3.7  Culler,Patterson 1994 Other Costs

Die Test Cost = Test Jig Cost * Ave. Test Time Die Yield Packaging Cost: depends on pins, heat dissipation , beauty, ...

Chip Die Package Test &Total cost pins type cost Assembly 386DX $4 132 QFP $1 $4 $9 486DX2 $12 168 PGA $11 $12 $35 PowerPC 601 $53 304 QFP $3 $21 $77 HP PA 7100 $73 504 PGA $35 $16 $124 DEC Alpha $149 431 PGA $30 $23 $202 SuperSPARC $272 293 PGA $20 $34 $326 Pentium $417 273 PGA $19 $37 $473

cs252 3.8  Culler,Patterson 1994 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS

Fixed-point representation of Avogadro’s number: N0 = 602 000 000 000 000 000 000 000 A oating-point representation is just scientic notation: N =6.02 1023 0 . The decimal point “oats”: It can represent any power of 10 . A oating-point representation is much more economical than a xed-point representation for very large or very small numbers Examples of areas where a oating-point representation is necessary: . Engineering: Electromagnetics, aeronautical engineering . Physics: Semiconductors, elementary particles Fixed-point representation often used in digital signal processing (DSP)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (2)

Decimal oating-point representation: . Decimal place-value notation, extended to fractions . Expand a real number r in powers of 10:

r = dndn 1 d0.f1f2 fm n n 1 0 = dn10 + dn 110 + + d010 f f f + 1 + 2 + + m + 10 102 10m . Scientic notation for the same real number: n r = dn.dn 1 d0f1f2 fm 10 . Re-label the digits: f n ∞ k0 n r = f00.f10f20 fm0 10 = k 10 kX=0 10

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (3)

Binary oating-point representation: . Binary place-value notation, extended to fractions . Expand a real number r in powers of 2: f n ∞ k n r = f0.f1f2 fm 2 = k 2 kX=0 2 . Example: 1 1 1 1 = + + + 3 4 16 64 2 = 1.0101010 2 . Normalization: Require that f =0 0 6

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (4)

Important binary oating-point numbers: . One: 1.0=1.00 20 . Two: 1.111 20 =1+1 + 1 + 1 + 2 4 8 1 = (sum of a geometric series) 1 1 2 = 1.000 21 . Rule: 1.b b b 111 = 1.b b (b +1)000 1 2 k 1 2 k (note that the sum bk + 1 may generate a carry bit)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (5)

A more complicated example: . Obtain the binary oating-point representation of the number 2.610 . Expand in powers of 2: 2.6=23 =1 21 +0 20 + 3 5 5 where (see later slides for a more ecient method) 3 1 1 5 = 2 + 10 = 1 + 1 +(1 1 )=1 + 1 + 3 = 1 + 1 +(1 )(3) 2 16 10 16 2 16 80 2 16 16 5 1 1 1 1 1 = + + + + + 21 24 25 28 29 = 0.10011001100 20 2.6 = 1.010011001100 21 10 c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTEGER OPERATIONS ON FLOATING-POINT NUMBERS

Requirements for optimization of important operations: . Use existing integer operations to test sign of, or compare, FPNs . Sign must be shown by most signicant bit . Lexicographic order of exponents = numerical order biased representation ⇒

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (6)

Bit sequence: n 1 n 20 s e f

2 conventions for assigning a value to this bit string: s e B s e B r =( 1) 2 0.f or r =( 1) 2 0 1.f .sis the sign bit, e is the exponent eld, B is the bias, f is the fraction or mantissa, and the extra 1 (if any) is the implicit 1 . The exponent is represented in biased format . The bits of the fraction are interpreted as the coecients of powers 1 of 2, in place-value notation

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (7)

IEEE-754 single precision format:

31 30 23 22 0 s e f

Numerical value assigned to this 32-bit word, interpreted as a oating- point number: s e 127 r =( 1) 2 1.f .eis the exponent, interpreted as an unsigned integer (0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (8)

IEEE-754 double precision format uses two consecutive 32-bit words:

31 30 20 19 0 s e f (high bits) 31 0 f (low bits)

Numerical value assigned to this 32-bit word, interpreted as a oating- point number: s e 1023 r =( 1) 2 1.f .eis the exponent, interpreted as an unsigned integer (0

In the single-precision format, the floating-point number is represented by an 8-bit exponent field (e) and a 2s-complement 24-bit mantissa field (man) with an implied most significant nonsign bit (see Figure 5±8).

31 24 23 22 0 Exponent Sign Fraction

Mantissa

Operations are performed with an implied binary point between bits 23 and 22. When the implied most significant nonsign bit is made explicit, it is located to the immediate left of the binary point. The floating-point number x is given by the following: x = 01.f ? 2e if s = 0 x = 10.f ? 2e if s = 1 x = 0 if e = ±128 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (9)

Find the numerical value of the oating-point number with the IEEE- 754 single-precision representation 0x46fffe00:

31 30 23 22 0 0 1000110111111111111111000000000

4 6 f f f e 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . Value of exponent = 0x8d B = 141 B = 141 127 = 14 11111111111111000000000 Value of fraction = 1 + 223

11111111111111 14 =1+ =2 111111111111111 100000000000000 15 10s | {z } 14 14 Value of number = 2 0x7fff 2 = 0x7fff = 3276710 c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (10)

Find the numerical value of the oating-point number with the IEEE- 754 double-precision representation 0x40dfffc0 00000000:

31 30 20 19 0 0 1000000110111111111111111000000

4 0 d f f f c 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } (second word is all 0’s) . Value of exponent = 0x40d B = 1037 B = 1037 1023 = 14 11111111111111000000 Value of fraction = 1 + 52 32 2

11111111111111 14 =1+ =2 111111111111111 100000000000000 15 10s | {z } 14 14 Value of number = 2 0x7fff 2 = 0x7fff = 3276710 c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (11)

Conversion of a number r from the decimal representation to the IEEE-754 single-precision binary representation: 1. If r<0, perform the following steps with r and change the sign at the end 2. Find the base-2 exponent k such that 2k r<2k+1 3. Compute e = k + B, where B = 127 for single precision and B = 1023 for double precision, and express e in base 2 r 4. Compute 1.f = ; check that 1 1.f < 2 2k bp 2 bp 3 b0 5. Expand 0.f =1.f 1 as a binary fraction + 2 + + p 1 2 2 2 where p = 24 for single precision and p = 53 for double precision. Then f = bp 2bp 3 b0.

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (12)

Convert 3.25 from the decimal representation to the IEEE-754 single- precision binary representation (see next slide for best method): 1. Since 3.25 < 0, we work with 3.25 and change the sign at the end 2. Since 2 3.25 < 22, the unbiased exponent is k =1 3. Compute e = k + B = 1 + 127 = 128 = 1000 00002 3.25 4. Compute 1.f = =1.625 2 1 0 1 5. Expand 1.625 1=0.625 = + + . 2 22 23 Then f = 101000000000000000000002.

31 30 23 22 0 1 1000000010100000000000000000000

c 0 5 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (12a)

Most ecient method for conversion of the fraction from base 10 to base 2: k 1. Let 0.f (base 2) = 0.d 1d 2 d k , where d k multiplies 2 2. Set F = fraction in base 10 (the digits after the decimal point) 3. Set k =1

4. Compute d k = 2F = integer part of 2F b c 5. Replace F with the fractional part of 2F (the part after the decimal point) 6. Replace k with k 1 7. If you have computed >pbits of f,orifF = 0, stop. Otherwise go to 4 and continue.

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (13)

Floating-point addition: 1. Assume that both summands are positive and that r0

2. Find the dierence of the exponents, e e0 > 0 3. Set bit 23 and clear bits 31–24 in both r and r0 r = r & 0x00ffffff; r = r 0x00800000; | 4. Shift r0 right e e0 places (to align its binary point with that of r) 5. Add r and r0 (shifted) as unsigned integers; u = r + r0 6. Compute t = u & 0xff000000; 7. Normalization: Shift u right t places and compute e = e + t 8. Compute f = u-0x00800000; f is the fraction of the sum.

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (14a)

Example of oating-point addition: Compute r + rp, where r =3.25 = 0x40500000, rp =0.25 = 0x3e800000 1. Both summands are positive and rp =0.25 < r =3.25 2. The dierence of the exponents is 1 ( 2)=3 3. Copy bits 22–0, clear bits 31–24 and set bit 23 in both r and rp:

a1 = r & 0x007fffff = 0x00500000 u1 = a1 0x00800000 = 0x00d00000 | a2 = rp & 0x007fffff = 0x00000000 u2 = a2 0x00800000 = 0x00800000 |

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (14b)

Example of oating-point addition: 3.25+0.25 (continued) 4. Shift u2 right e e0 = 3 places to align its binary point: u2 (shifted) = 0x00100000 5. Add u1 and u2 (shifted) as unsigned integers: u = u1 + u2 (shifted) = 0x00e00000 6. Compute t = u & 0xff000000 = 0x00000000 7. Normalization: Unnecessary in this example, because t =0 8. Subtract implicit 1 bit: f = u-0x00800000 = 0x00600000 1 1 Value of 1.f =1+2 + 4 Value of sum = 1+1 + 1 21 2 4 1 1 1 9. Answer: r + rp =(1+2 + 4) 2 Check: 3.25 + 0.25=3.5=1.75 21

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (15)

Find the smallest normalized positive oating-point number in the IEEE-754 single-precision representation:

31 30 23 22 0 0 0000000100000000000000000000000

0 0 8 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . Value of exponent = k = 0x01 B =1 B =1 127 = 126 0 Value of fraction = 1.f =1+ =1 223 126 38 Value of smallest normalizable number = 2 1 1.175 10

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (16)

Find the largest normalized positive oating-point number in the IEEE-754 single-precision representation:

31 30 23 22 0 0 1111111011111111111111111111111

7 f 7 f f f f f | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . Value of exponent = 0xfe B = 254 B = 254 127 = 127 11111111111111111111111 Value of fraction = 1 + 223 11111111111111111111111 =1+ 2 100000000000000000000000 Value of number 2127 2 3.403 1038 c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (17)

Find the smallest and largest normalized positive oating-point num- bers in the IEEE-754 double-precision representation: 1022 307.65 308 . Smallest number is 1 2 10 4.4668 10 4095 1023 1023 307.95 307 . Largest number 2 2 =2 10 8.9125 10

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (18)

How many oating-point numbers are there in a given representation? . Let = base p = number of signicant digits . Number of values of exponent e = emax emin +1 p 1 . Number of properly normalized values of the fraction = 2( 1) (taking signs into account) . Total number of normalized oating-point numbers in a represen- tation: p 1 N(, p, e ,e )=2( 1) (e e +1) max min max min 1, if zero is unsigned, or +   2, if zero is signed. 

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (19)

Machine epsilon, mach: The smallest positive oating-point number such that 1. + mach > 1. . Generally 1 p mach = where is the base p is the number of signicant digits . For IEEE-754, 23 7 2 1.19 10 in single precision; mach =  52 16  2 2.22 10 in double precision. 

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (20)

Machine epsilon in the IEEE-754 single-precision representation: 23 1. Compute r + rp =1+2 2. The dierence of the exponents is 0 ( 23)=23 3. Set bit 23 and clear bits 31–24 in both r and rp: r & 0x00ffffff = 0x00000000, r 0x00800000 = 0x00800000 rp & 0x00ffffff = 0x00000000, rp| 0x00800000 = 0x00800000 | 4. Shift rp right e e0 = 23 places to align its binary point with that of r: 0x00000001 5. Add r and rp (shifted) as unsigned integers: u = r + rp = 0x00800001 6. Compute t = u & 0xff000000 = 0x00000000 7. Normalization: Unnecessary in this example, because t =0 23 8. Compute f = u-0x00800000 = 0x00000001; value = 1 + 2 9. A smaller rp results in u = r

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (21)

Consecutive oating-point numbers (IEEE-754 single precision): . 1.00000000000000000000000 20:

31 30 23 22 0 0 0111111100000000000000000000000

3 f 8 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . 1.00000000000000000000001 20:

31 30 23 22 0 0 0111111100000000000000000000001

3 f 8 0 0 0 0 1 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (22)

Consecutive oating-point numbers (IEEE-754 single precision): 1 . 1.11111111111111111111111 2 :

31 30 23 22 0 0 0111111011111111111111111111111

3 f 7 f f f f f | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . 1.00000000000000000000000 20:

31 30 23 22 0 0 0111111100000000000000000000000

3 f 8 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (23)

Dierences between consecutive oating-point numbers (IEEE-754 single precision): 1.00000000000000000000001 20 1.00000000000000000000001 20 23 1 p =2 =

0 1 1.00000000000000000000000 2 1.11111111111111111111111 2 24 p =2 = There is a “wobble” of a factor of (= 2 in binary oating-point representations) between the maximum and minimum relative change represented by 1 unit in the last place . The “wobble” is 16 in hexadecimal representations . Base 2 is the best for oating-point computation

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (24)

As for all sign-magnitude representations, there are two IEEE-754 single-precision representations for zero: . +0.00000000000000000000000:

31 30 23 22 0 0 0000000000000000000000000000000

0 0 0 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } . .00000000000000000000000:

31 30 23 22 0 1 0000000000000000000000000000000

8 0 0 0 0 0 0 0 | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (25)

Invalid results of oating-point operations after normalization: . Overow: e>emax Single precision: e = 254 (including a bias of 127) max Double precision: e = 2046 (including a bias of 1023) max . Underow: e

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (26)

Example of valid use of Inf: Evaluate 1 f(x)= 1+1/x 5 starting at x =0.0, in steps of 10 . First evaluation: (at x =0.0) 1/x = Inf Value returned for f(x)is1/Inf = 0.0 . Correct value is returned despite division by zero! . Computation continues, giving correct result for all values of x

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (27)

Example of invalid comparison of oating-point numbers: if (x.ne.y) then z=1./(x-y) else z=0. ierr=1 end if . Consider 126 x = 1.00100 0 2 , 126 y = 1.00010 0 2 126 x y = 0.00010 0 2 (underow after normalization) . If the underowed result is “ushed” to zero, then the statement z=1./(x-y) results in division by zero, even though x and y com- pare unequal! c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (28)

Denormalized oating-point numbers (IEEE-754 single precision):

31 30 23 22 0 s 00000000 f

8 or 0 0 0–7 0–f 0–f 0–f 0–f 0–f | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } Value assigned to a denormalized number: s 126 d =( 1) 0.f 2 . No implicit 1 bit . Purpose: gradual loss of signicance for results that are too small to represent in normalized form

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (29)

NaNs in IEEE-754 single precision:

31 30 23 22 0 s 11111111 f =0 6 7 or f f 8–f 0–f 0–f 0–f 0–f 0–f | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } The following operations generate the value NaN: . Addition: Inf +( Inf) . Multiplication: 0 Inf . Division: 0/0orInf/Inf . Computation of a remainder (REM): x REM 0 or Inf REM x . Computation of a square root: √x when x<0 Purpose: Allow computation to continue c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (30)

Rounding in IEEE-754 single precision: 31 30 23 22 0 s e f g1g2 bs

. Keep 2 additional (“guard”) bits, plus a “sticky bit” . Common rounding modes: Truncate Round toward 0 Round toward Inf Round to nearest, with tie-breaking when g = 1 and g = b =0 1 2 s Tie-breaking method 1: Round up (biases result) Tie-breaking method 2: Round so that bit 0 of the result is 0 (“round to even”; unbiased)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLOATING-POINT REPRESENTATIONS (31)

Catastrophic cancellation in subtraction: . Occurs when the relative error of the result is large compared to machine epsilon: [round (x) round (y)] round (x y) mach round (x y) . Example ( = 10, p = 3, round to even): Suppose we have computed results x =1.005, y =1.000 Then round (x)=1.00, round (y)=1.00 but round (x y)= 3 5.00 10 Relative error of result is 1 mach This is the main reason for using double precision

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SEQUENTIAL LOGIC CIRCUITS (1)

The outputs z ,...,z of a sequential logic circuit depend on: 1 m . The inputs x1, ... , xn

. Internal logical variables y1,...,yr (the present state)

. The next state y1, ... , yr depends on the inputs and the present state:

y = h (x , ... , x ,y,...,y)[j (1 : r)] j j 1 n 1 r ∈ Contrast with a combinational logic circuit, where the outputs depend only on the inputs: z = f (x , ... , x )[i (1 : m)] i i 1 n ∈ For a sequential logic circuit, the outputs depend on the inputs and the present state: z = g (x ,...,x,y, ... , y )[i (1 : m)] i i 1 n 1 r ∈ c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SEQUENTIAL LOGIC CIRCUITS (2)

Human model of a sequential circuit: . State element Memory that holds the present state Normally updated at intervals controlled by a clock signal . Combinational logic that implements the Boolean functions zi (outputs) and yj (next state)

Inputs Outputs

Combinational logic State element

Clock-controlled update

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LATCHES

A latch is a sequential logic circuit that is controlled entirely by its excitation inputs, without a clock signal . A latch is a bistable device Can remain indenitely in either of two stable states For all but the simplest latches, the state can be changed by asserting one of the latch’s inputs Used as a 1-bit memory element . Simplest latches Set latch (can be set to 1, but not reset to 0) Reset latch (can be set to 0, but not reset to 1) . Practical latches Set-reset latch Delay latch

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SET LATCH

A set latch can be set to 1, but not reset to 0 . Initial state: Input = 0, output = (undened)

. Input asserted output = 1: ⇒

. Output is now independent of input:

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RESET LATCH

A reset latch can be reset to 0, but not set to 1 . Initial state: Input = 0, output = (undened)

. Input asserted output = 0: ⇒

. Output is now independent of input:

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SET-RESET (SR) LATCH (1)

• Initial state: Input = 0, outputs = × (undefined)

• R =1,S=0⇒ state in which Q =0, Q =1:

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SET-RESET (SR) LATCH (2)

• Next state: Inputs = 0, outputs unchanged (storage of previous state)

• R =0,S=1⇒ state in which Q =1, Q =0:

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SET-RESET (SR) LATCH (3)

• Next state: Inputs = 0, outputs unchanged (storage of previous state)

• R =1,S=1⇒ race condition (gate with shortest delay “wins”):

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SET-RESET (SR) LATCH (4)

• Timing diagram:

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

S-R LATCH EXCITATION TABLE

S-R Latch Excitation Table Inputs Old State Next State S R Q Q Q Q Comments 0 0 0 1 0 1 No change 0 0 1 0 1 0 (storage state) 0 1 0 1 0 1 Reset 0 1 1 0 0 1 1 0 0 1 1 0 Set 1 0 1 0 1 0 1 1 0 1 Race condition 1 1 1 0 when S, R 1 →

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KARNAUGH MAP OF S-R LATCH EXCITATION TABLE

Karnaugh map showing the next state, Q, as a function of the inputs R, S and the present state, Q: Q RQ S Q Q 00 01 11} 10 0 1 S 0 132 1 11 ddS 4 576 } }

R R . The “don’t cares” (d) can be covered by any minterm, because the outputs Q and Q are undened . Choose the minterms that give the simplest logic circuit

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EQUATION FOR NEXT STATE OF S-R LATCH

The state of an S-R latch is specied uniquely by Q . Characteristic equation:

Q = S + R Q Gives next state (Q) in terms of present state (Q) and excitation inputs (S, R) Can be derived from excitation table . When S = R = 1, both Q and Q are set to logical 1 until at least one input changes

The next state (Q) is undened The characteristic equation does not apply to this case

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

S-R LATCH STATE TABLE

S-R Latch State Table Inputs Old State Next State Outputs S R Q Q Q Q Q Q 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0

For the S-R latch, the outputs are the same as the next state variables . This isn’t true for the state table of every sequential circuit

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FINITE STATE MACHINES (1)

A nite state machine is a conceptual tool used to describe the computational functioning of a sequential logic circuit without specifying the implementation

Next state Next-state Current state function

Clock Inputs

Output Outputs function

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FINITE STATE MACHINES (2)

A state diagram is a certain kind of directed graph . The nodes (vertexes) represent states of the machine A state is dened by the values of the internal logical variables Each node in a state diagram is labeled with the values that dene the state that corresponds to the node . The edges represent the state transitions Each edge in a state diagram is labeled with the inputs that cause the state transition that corresponds to the edge

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Scienc e

FINITESTATEMACHINES(3)

ThestatediagramfortheS-Rlatch:

R = d* R = 0,1 S = 1 R = 0 S = 0 S = 0,1

Q = 0 Q = 1

Q = 1 Q = 0

R = 1 S = 0

* R = 1 when S = 1 gives a race condition

c C.D.Cantrell(05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FINITE STATE MACHINES (4)

Finite state machines (FSMs) are used in industry to dene the functions of a sequential logic circuit or a computer program without specifying the details of the implementation A FSM that describes hardware: . The control functions for the simple ALU in P&H, Chapter 5 FSMs that are usually implemented in software: . The Transmission Control Protocol (TCP) You invoke this protocol whenever you click on a link in a World Wide Web page . The Point-to-Point Protocol (PPP) You invoke this protocol whenever you start Dial-Up Networking in Windows

c C. D. Cantrell (03/1999) Finite State Machine for Simple ALU

Instruction decode/ Instruction fetch register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 ALUSrcA = 0 Start IRWrite ALUSrcB = 11 ALUSrcB = 01 ALUOp = 00 ALUOp = 00 PCWrite PCSource = 00 e) ') yp Q ') -t E R 'J B p = ' (O = =

') p Memory address 'SW p = Branch (O O Jump p ( computation r (O ') o Execution completion completion 'LW p = 2 (O 6 8 9 ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA =1 PCWrite ALUSrcB = 10 ALUOp = 01 ALUSrcB = 00 PCSource = 10 ALUOp = 00 ALUOp= 10 PCWriteCond PCSource = 01

(O

) p ' =

W 'S L W ' ') =

p Memory Memory (O access access R-type completion 3 5 7

RegDst = 1 MemRead MemWrite RegWrite IorD = 1 IorD = 1 MemtoReg = 0

Write-back step 4

RegDst=0 RegWrite MemtoReg=1

How many bits are needed to number each state uniquely? starting point CLOSED

appl: passive open send: appl:

send: SYNactive open timeout send: RST LISTEN passive open appl:

send: SYNsend data

recv: RST recv: SYN; send: SYN, ACK recv: SYN appl: close SYN_RCVD SYN_SENT send: SYN, ACK or timeout simultaneous open active open send: recv: ACK

recv: SYN,send: ACK

appl: close recv: FIN ESTABLISHED CLOSE_WAIT send: FIN send: ACK data transfer state

close appl: close send: FIN appl: send: FIN simultaneous close recv: FIN recv: ACK FIN_WAIT_1 CLOSING LAST_ACK send: ACK send: half close recv: FIN, ACK passive close send: ACK recv: ACK recv: ACK send: send:

recv: FIN FIN_WAIT_2 TIME_WAIT send: ACK 2MSL timeout active close

normal transitions for client normal transitions for server appl: state transitions taken when application issues operation recv: state transitions taken when segment received send: what is sent for this transition

TCP state transition diagram

Reprinted from TCP/IP Illustrated, Volume 2: The Implementation by Gary R. Wright and W. Richard Stevens, Copyright © 1995 by Addison-Wesley Publishing Company, Inc. PPP State Machine

9 Opened

RCA RCR+/SCR,SCA RTA/SCR TO+,RCN/SCR RTR/STA RCR+/SCR,SCN 8 RXJÐ/STR RCN/SCR Ack-Sent TO+/STR RCR+/SCA RCA/SCR RCR+/SCA RTR/STA 5 Stopping Close/ RCRÐ/SCNRCR+/ RCR+/ STR SCA SCR,SCA Open Close 7 Close/STR 4 Ack-Rcvd RTA Closing RCRÐ/SCN RXJÐ RXJÐ RTR/STA RXJÐ TOÐ RCRÐ/SCN TOÐ RCA,RCN, RCA TOÐ RTR/STA RTR/STA RTR/STA TO+/STR 3 RXJ+ 6 RCRÐ/SCR,SCN RTR/STA RTA Req-Sent RXJÐ TOÐ Stopped RCA,RCN, TO+/SCR Open/STR TOÐ Up/SCR Close RTA TO+/SCR RXJÐ RCN/SCR 2 Closed Down Down Up RCR,RCA,RCN,RTR/STA 1 Down Starting Down Open

Close 0 Initial Down

After James Carlson, PPP: Design and Debugging THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH (1)

• A gated SR latch (CSR latch) is controlled both by its excitation inputs (S, R) and an “enable” signal (C)  Notice how the NANDs isolate the S-R latch from the inputs when the clock signal is low, and pass the inputs through to the latch when the clock signal is high

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH (2)

• A gated SR latch (CSR latch) is controlled both by its excitation inputs (S, R) and an “enable” signal (C)

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH (3)

• Timing diagram:

|0 |125 |250 |375 |500 |625 input Clock

input Reset

input Set

output Q

output Q'

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH (4)

CSR Latch Excitation Table Inputs Old State Next State ∗ C S R Q Q Q∗ Q Comments 0 × × 0 1 0 1 Hold 0 × × 1 0 1 0 1 0 0 0 1 0 1 No change 1 0 0 1 0 1 0 (storage state) 1 0 1 0 1 0 1 Reset 1 0 1 1 0 0 1 1 1 0 0 1 1 0 Set 1 1 0 1 0 1 0 1 1 1 0 1 × × Race condition 1 1 1 1 0 × × when S, R → 1

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH (5)

• The state ofa CSR latch is specified uniquely by Q  Characteristic equation: Q∗ = CQ + CS + CRQ ◦ Gives next state (Q∗) in terms ofpresent state ( Q), excitation inputs (S, R), and clock signal (C) ◦ When clock is high (C = 1), this is the same as the characteristic equation ofthe S-R latch ( Q∗ = S + RQ) ◦ When clock is low (C = 0), the next state is the same as the present state (Q∗ = Q) ◦ Can be derived from excitation table  The characteristic equation does not apply when S = R =1

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CSR LATCH WITH “PRESET” AND “CLEAR” INPUTS

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

D LATCH (1)

• A delay latch (D latch) is a CSR latch in which the inputs are hardwired to be the logical complements of one another

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

D LATCH (2)

The state of a D latch is specied uniquely by Q . Characteristic equation:

Q = CQ + CD + CDQ = CQ + CD Derived from the characteristic equation of the CSR latch (Q = CQ + CS + CRQ) by setting S = D, R = D Can also be derived from excitation table

C = 1 C = 0 D = 1 C = 0 D = 0,1 D = 0,1

Q = 0 Q = 1

Q = 1 Q = 0

C = 1 D = 0

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

D LATCH (3)

Another realization of a D latch uses a direct implementation of the next-state function Q = CQ + CD . No race condition exists, but there is a static 1 hazard, which can be removed by adding an AND gate that implements the minterm DQ

c C. D. Cantrell (05/1999) Karnaugh map for next state function of a D latch

C DC Q C C 00 01 11} 10 0 1 Q 0 132 1 1 11Q 4 576 } }

D D

Removes static 1 hazard THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FLIP-FLOPS (1)

A ip-op is a bistable sequential logic circuit that is controlled both by its excitation inputs and a clock signal . The state can be changed by asserting one of the ip-ops’s inputs either while the clock signal is asserted, or when the clock signal changes . In useful designs, the output does not change until the leading edge of the next clock signal Master-slave ip-ops . Master-slave SR ip-op . Master-slave D ip-op

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MASTER-SLAVE S-R FLIP-FLOP (1)

For as long as the clock is low, the inputs R and S, and the present state Q1, control the next state of the rst S-R latch When the clock goes high, the outputs (Q and Q) of the rst S-R latch, which are the inputs of the second latch, can change the second latch’s state

Initial state (just turned on, clock high)

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MASTER-SLAVE S-R FLIP-FLOP (2)

Just after clock goes low

Just after clock returns to high c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MASTER-SLAVE D FLIP-FLOP (1)

1 0 1 1 D 0 Q 0 1 0

C

0 1 1 0 1 0 Q’

MASTER LATCH SLAVE LATCH

Just after D is asserted while clock is low; present state is Q =0

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MASTER-SLAVE D FLIP-FLOP (2)

1 1 1 0 D 1 Q 1 0 1

C

0 1 0 0 1 0 Q’

MASTER LATCH SLAVE LATCH

Just after clock returns to high; note transfer of state to second latch

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MASTER-SLAVE D FLIP-FLOP (3)

|0 |62.5 |125 |187.5 |250 |312.5 inputC

inputD

outputQ

outputQ'

Timing diagram; note that state transitions occur on the leading (rising) edge of the clock signal

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PULSE-TRIGGERED JK FLIP-FLOP (1)

0 0 0 Q 0 1 J K 1 0 0 Q 0 0 0 1 K Q’ J 0 1 0 1 Q’ Trigger Trigger Circuit symbol

The JK ip-op is similar to a master-slave SR ip-op, except that when the J and K inputs are both high, the output “toggles” on the trailing (falling) edge of the clock signal. The circuit shown is equivalent to the circuit used in a commercial device, the SN7476.

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PULSE-TRIGGERED JK FLIP-FLOP (2)

|0 |125 |250 |375 |500 |625 inputJ

inputK

inputTrigger

outputQ

outputQ'

Timing diagram; note that state transitions occur on the trailing (falling) edge of the trigger signal

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

T FLIP-FLOP (1)

1 1 1 Q T 0 0 Circuit symbol

Q’ 0 1

Clock

T stands for “toggle”; note that this is simply a JK ip-op with the J and K inputs held at Vdd

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

T FLIP-FLOP (2)

|0 |125 |250 |375 inputClock

inputJK

outputQ

outputQ'

Timing diagram; note that state transitions occur on the trailing (falling) edge of the clock signal

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT COUNTER

Among the many useful devices that can be made with T ip-ops are counters . How many clock cycles elapse between ashes of the LED (C)?

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REGISTER FILE

5 Read register number 1 Register 0 5 Register 1 M 32 Read register u number 1 Read 32 Read data 1 data 1 Register n Ð 1 x 5 Read register Register n number 2 5 5 Write Read register register number 2 Read 32 32 Write data 2 data Write M 32 u Read data 2 x

The register file is an array of arrays of flip-flops, addressed using a decoder, read using multiplexors. Data to be written to a particular register is broadcast to all registers, but only the specific register that is selected by an asserted “write enable” signal is modified.

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4-BIT REGISTER

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

APPLICATION OF DECODER TO REGISTER ADDRESSING

Write enable C 0 Register 0 1 D n n-to-2n C Register number decoder Register 1 (n bits) D 2n Ð 2 Data is broadcast to all registers, 2n Ð 1 but only the register selected by the decoder is modified C Register 2n Ð 2 D C n 32 Register 2 Ð 1 Register data D

• Each register has an “enable” input (labeled C in the figure)  A register’s enable input must be asserted in order for data to be written to the register through the “data” input (labeled D)  The enable input is controlled byan AND gate  Both the signal from the decoder and the “write enable” signal must be asserted in order for the register’s enable input to be asserted

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STEPS IN CREATING AN EXECUTABLE PROGRAM

Creation of source “code” . Accomplished with a text editor or a programming environment Compilation (if source program is in a higher-level language) . Includes one of various levels of optimization . Result of compilation may be an assembly-language program Assembly produces an object module . Human-readable instructions and macros machine language 7→ Linking (a.k.a. link editing) produces a load module . Resolution of calls to user and library functions . Unresolved function calls produce run-time exceptions Loading produces a memory-resident executable module . Memory references to code and data are updated to reect actual locations . Executable program is copied into memory c C. D. Cantrell (04/1999) C program

Compiler

Assembly language program

Assembler

Object: Machine language module Object: Library routine (machine language)

Linker

Executable: Machine language program

Loader

Memory THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A C PROGRAM AND ITS TRANSLATION INTO MIPS ASSEMBLER

C program: swap( int v[] , int k ) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } . Purpose of this program: swap two array elements Note the use of a temporary variable (temp) . Compilation into MIPS assembly language was performed with the lcc compiler (see following page) . Using a temporary variable in assembler requires unnecessary memory trac (an extra store and an extra load) . All that is needed is to load both array elements and then swap

them when storing (see the page after next) c C. D. Cantrell (05/1999) MEMORY-TO-MEMORY COPY

Processor IC Main Memory (Dynamic RAM)

byte FFFFFFFF Register File 7 0

Register 0 byte FFFFFFFE 7 0 . Register 1 .

Register 2 .

byte 00000003 7 0 (1) LOAD

(2) STORE Register 30 byte 00000002 7 0

Register 30 byte 00000001 7 0

byte 00000000 7 0

addresses data SWAP THE VALUES OF TWO ARRAY ELEMENTS

Processor IC Main Memory (Dynamic RAM)

byte FFFFFFFF Register File 7 0

Register 0 byte FFFFFFFE 7 0 . Register 1 . (2) STORE Register 2 .

byte 00000003 7 0 (1) LOAD

Register 30 byte 00000002 7 0

Register 30 byte 00000001 7 0

byte 00000000 7 0

addresses data .option O0 # initialization stuff for loader; .set noat # ignore .set noreorder # in the following, $tn means temporary .text # register n .globl swap__FPii # $t0 is general purpose register $8 .ent swap__FPii # $t7 is general purpose register $15 swap__FPii: REL. OPCODE INSTR. ARGUMENTS COMMENTS ADDRESS ------000000 27bdfff0 addiu $sp,$sp,-16 # push arguments, etc. onto the stack 000004 afa40010 sw $a0,16($sp) # store 1st argument as addr v[0] 000008 afa50014 sw $a1,20($sp) # store 2nd argument as k # Source offset: 40 00000c 8fa80014 lw $t0,20($sp)#load k into register $t0 000010 00000000 nop 000014 00084880 sll $t1,$t0,2 # put 4*k in register $t1 000018 8faa0010 lw $t2,16($sp)#load address of v[0] into $t2 00001c 00000000 nop 000020 01495821 addu $t3,$t2,$t1 # addr v[k] = addr v[0] + 4*k 000024 8d6c0000 lw $t4,0($t3) # load v[k] into $t4 000028 00000000 nop 00002c afac000c sw $t4,12($sp) # store v[k] into temp # Source offset: 57 000030 8fad0014 lw $t5,20($sp)#load k into register $t5 000034 00000000 nop 000038 21ae0001 addi $t6,$t5,1 # put k+1 into register $t6 00003c 000e7880 sll $t7,$t6,2 # multiply k+1 by 4; result in $t7 000040 8fb80010 lw $t8,16($sp)#load address of v[0] into $t8 000044 00000000 nop 000048 030fc821 addu $t9,$t8,$t7 # addr v[k+1] = addr v[0] + 4*(k+1) 00004c 8f280000 lw $t0,0($t9) # load v[k+1] into $t0 000050 8fa90014 lw $t1,20($sp)#load k into register $t1 000054 00000000 nop 000058 00095080 sll $t2,$t1,2 # multiply k by 4; put result in $t2 00005c 8fab0010 lw $t3,16($sp)#load address of v[0] into $t3 000060 00000000 nop 000064 016a6021 addu $t4,$t3,$t2 # addr v[k] = addr v[0] + 4*k 000068 ad880000 sw $t0,0($t4) # store $t0 (i.e., v[k+1]) into v[k] # Source offset: 76 00006c 8fad000c lw $t5,12($sp)#load temp (i.e., v[k]) into $t5 000070 8fae0014 lw $t6,20($sp)#load k into $t6 000074 00000000 nop 000078 21cf0001 addi $t7,$t6,1 # put k+1 into $t7 00007c 000fc080 sll $t8,$t7,2 # put 4*(k+1) into $t8 000080 8fb90010 lw $t9,16($sp)#load address of v[0] into $t9 000084 00000000 nop 000088 03384021 addu $t0,$t9,$t8 # addr v[k+1] = addr v[0] + 4*(k+1) 00008c ad0d0000 sw $t5,0($t0) # store $t5 (i.e., old v[k]) to v[k+1] # Source offset: 91 000090 27bd0010 addiu $sp,$sp,16 # pop arguments, etc. off the stack 000094 03e00008 jr $ra # return to the calling routine 000098 00000000 nop ^^^^^^ ^^^^^^^^ ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ REL. OPCODE INSTR. ARGUMENTS COMMENTS ADDRESS .end swap__FPii # ***************************************** HAND-COMPILEDVERSIONOFswap.s ASSUMPTIONS:addressofv[0]isinregister$a0 valueofkisinregister$a1

INSTR. ARGUMENTS COMMENTS ------addiu $sp,$sp,-16 #pusharguments,etc.ontothestack # (notneededifthisisaleaf # procedure) sw $a0,16($sp) #store1stargumentasaddrv[0] sw $a1,20($sp) #store2ndargumentask sll $t0,$a1,2 #put4*kintoregister$t0 addu $t1,$a0,$t0 #addrv[k]=addrv[0]+4*k lw $t2,0($t1) #loadv[k]into$t2 addi $t3,$a1,1 #putk+1intoregister$t3 sll $t4,$t3,2 #multiplyk+1by4;resultin$t4 addu $t5,$t4,$a0 #addrv[k+1]=addrv[0]+4*(k+1) lw $t6,0($t5) #loadv[k+1]into$t6 sw $t2,0($t5) #store$t2(i.e.,oldv[k])tov[k+1] sw $t6,0($t1) #store$t6(i.e.,v[k+1])intov[k] addiu $sp,$sp,16 #poparguments,etc.offthestack jr $ra #returntothecallingroutine THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNIX OBJECT FILE FORMAT

Object le header . Describes the sizes and positions of the remaining parts Text segment . Contains the machine language code Data segment . Includes both static and dynamic data Relocation information . Identies instructions and data that depend on absolute addresses Symbol table . Contains undened labels (e.g., external references) Debugging information

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FUNCTIONS OF THE LINK EDITOR (LINKER)

Goal: To conne recompilation to those modules that have changed . Each procedure can be compiled independently The linker: . Places text and data modules symbolically in memory . Determines the addresses that correspond to the labels that the programmer has given to instructions and data . Patches the external and internal memory references . Produces an executable le Executable le format . Same as the format of an object le, except that there are no unre- solved references, relocation information, symbol table, or debug- ging information

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FUNCTIONS OF THE LINK EDITOR/LOADER

Object file

sub: ¥ Object file ¥ Executable file ¥ Instructions main: main: jal ??? jal printf ¥ ¥ ¥ ¥ ¥ ¥ jal ??? jal sub printf: call, sub Linker ¥ Relocation call, printf ¥ records ¥ sub: ¥ C library ¥ print: ¥ ¥ ¥ ¥

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPONENTS OF AN INSTRUCTION SET ARCHITECTURE

Data types supported in hardware Computational operations supported in hardware Memory access and addressing modes Branch implementation Support for procedure calls Instruction encoding Interrupt and exception handling Computer

Processor Memory Peripheral Devices

Control Input

Datapath Output

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INSTRUCTION SET DESIGN

• A key step in both hardware and software design  Software: ◦ The instruction set architecture (ISA), and its implementation, determine what kinds of programs perform well  Hardware: ◦ Up to now, we have worked at the gate level ◦ In order to design or understand an ISA, we have to work at a higher level of organization, using modules or functional units Main memory Cache and cache controller Register file Integer unit (ALU) Floating-point unit (FPU) Branch unit Exception/interrupt processing unit Vector unit c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INSTRUCTION SET DESIGN ISSUES

What operations should be provided in hardware? . An optimization problem . Any computation can be done using only load/store/increment/branch Resulting code: large, with many memory references slow ⇒ How many / what kinds of operands? . Must provide for one operand; should provide for two . Examples: a a, a, b c 7→ h i7→ What addressing modes should be implemented? . Choices made here aect complexity of both datapath and control . Fixed-length vs. variable-length instructions How to encode operations in a few bits? . Hardware design issue: how to translate operation encoding and addressing into control signals? How should interrupts and exceptions be handled? c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

KINDS OF INSTRUCTION SET ARCHITECTURES

Accumulator . Requires only one address . Example: add x means acc acc + M[x] 7→ Stack . Requires no explicit memory addresses . Example: add means tos tos + next 7→ General-purpose register set . Some operands may be in memory . Example of adding two numbers in memory locations a and b: load r1,a; add r1,b; store r1,c . Load/store architectures All arithmetic/logical operations use register operands Only load and store instructions reference memory c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ADVANTAGES AND DISADVANTAGES OF VARIOUS ISAs

Accumulator + Minimizes internal state of CPU Maximizes memory-to-CPU trac Stack + Reverse-polish expression evaluation (like HP calculator) Stack can’t be randomly accessed bottleneck ⇒ General-purpose register set + Most general model for code generation Each operand must be named longer instructions ⇒ Load/store + Fixed-length instruction encoding, with one . Simplies hardware datapath and control maximum speed ⇒ Increases total instruction count c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WHY REGISTER ARCHITECTURES ARE DOMINANT

Since 1980, logic circuits have become much faster than dynamic RAM . Registers are faster than main memory . Registers are faster than a stack A stack is built in main memory Operands that are in registers can be used in any order . Registers can hold variables as they are updated in a multi-step computation Reduction of main memory trac makes computations faster Register addresses (typically 5 bits) are shorter than main memory addresses (typically 32 to 64 bits) Code density (information per bit in an instruction) is higher in load-store architectures than in architectures that allow memory-resident operands Short, xed-length addressing xed-length instructions ⇒ For a register machine, one can optimize the most frequently used instructions (which are also simple, by experiment) c C. D. Cantrell (04/1999) The top 10 instructions for the 80x86

Integer average Rank 80x86 instruction (% total executed) 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RISC INSTRUCTION SET ARCHITECTURES

• Common properties:  Simple, pipelined load-store architecture ◦ Simple circuits ⇒ high clock frequency ◦ Pipelining divides the CPI by (roughly)the number of stages  Large number of general-purpose registers (compared to CISC) ◦ Allocate active variables to registers, not memory  (Mostly)orthogonal instruction set ◦ Only the most frequently executed operations  Simple addressing modes ◦ Tightly integrated, hierarchical memory architecture • Variations on the common theme:  Register file design  Branch implementation  Floating-point unit design c C. D. Cantrell (01/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THE MIPS INSTRUCTION SET ARCHITECTURE

• Designed by John Hennessy’s team at Stanford • Features of the MIPS R2000 architecture:  Very small instruction set  Minimal set of addressing modes  Single set of registers  Fast load and store ◦ 2 memory pipes (instruction and data) ◦ Most similar predecessor: CRAY-1  Short, deterministic delays on loads and branches

c C. D. Cantrell (01/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSEMBLERS (1)

Internal components of a computer system: . Integer processor (CPU) integer registers integer arithmetic logical operations . Floating-point (oating-point unit, FPU) oating-point registers oating-point arithmetic . Random-access memory (RAM) . I/O Assembly-language instructions control each of these components . One line of assembly code one hardware instruction (usually) ⇒

c C. D. Cantrell (01/1999) Memory

CPU FPU (Coprocessor 1) Registers Registers $0 $0 ...... $31 $31 Arithmetic Multiply Unit Divide Arithmetic Unit Lo Hi

Coprocessor 0 (Traps and Memory)

BadVAddr Cause

Status EPC Processor IC

L L 1 2 C C Memory CPU bus I/O Bus a a Memory I/O devices Registers c c h h e e Disk memory Register L1 cache L2 cache Memory reference reference reference reference reference

Size: 200 B 64 KB 1 MB 128 MB 12 GB Speed: 2 ns 4 ns 8 ns 60 ns 5 ms

Levels in a typical memory hierarchy THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSEMBLERS (2)

How to design a programming language using only hardware instructions? . Must take care of dataow and control . Data segment(s) of program Begins with a special directive (.data) Static (assembly-time) memory allocations for inputs & outputs of program Area for dynamic (run-time) memory allocation . Control (“text”) segment(s) of progam Uses hardware instructions to tell the processor what to do with the inputs & how to obtain the outputs Implicitly allocates memory for instructions

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSEMBLERS (3)

General syntax of an assembly-language program: . Only one statement per line . Fixed format . Statements may have names (labels) to make programming easier . Kinds of statements (explained in detail on later slides): Comments Declarations Assignment statements Branch/jump instructions Communication with user or OS

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSEMBLERS (4)

Parts of a MIPS assembly-language program: . Comments Everything on a line after a # symbol is a comment Comments may not span more than one line . Directives and declarations Directives (.data, .text, .stack) Tell the assembler about the program’s major memory regions (data, instructions, stack) Declarations of variables Tell the assembler how much memory is needed for variables Assign names to memory blocks Form: label: .keyword value:number of elements

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SAL — SIMPLE ABSTRACT LANGUAGE

Motivation for SAL: . SAL purposely hides the details of MIPS assembly language . SAL code looks a lot like C or Pascal code . Introduces a useful level of abstraction between higher-level lan- guage and true assembly language Each C or Pascal statement translates to 1 or more SAL instructions Each SAL instruction translates to 1 or more MIPS assembly language instructions The SPIM/SAL program (Windows and Macintosh) accepts both SAL and MIPS assembly language instructions

c C. D. Cantrell (01/1999) SAL (Simple Abstract Language)

In the following, x must be a register or the label of an address in memory. However, y and z may be registe(y), labels or constants. (y) implies "contents of y".

1. INSTRUCTIONS

Call Arg 1 Arg 2 Arg3 Operation

add x y z (x) <- (y) + (z) and x y z (x) <- (y) AND (z) b label PC <- (label) beq y z label PC <- (label) if (y) = (z) beqz y label PC <- (label) if (y) = 0 bgeyzlabelPC<-(label)if(y)>=(z) bgez y label PC <- (label) if (y) >= 0 bgt y z label PC <- (label) if (y) > (z) bgtz y label PC <- (label) if (y) > 0 ble y z label PC <- (label) if (y) <= (z) blez y label PC <- (label) if (y) <= 0 blt y z label PC <- (label) if (y) < (z) bltz y label PC <- (label) if (y) < 0 bne y z label PC <- (label) if (y) != (z) bnez y label PC <- (label) if (y) != 0 cvt x y x <- (y) with type conversion div x y z (x) = (y) DIV (z) done stop running; return control to OS get x (x) <- value typed on keyboard getc x (x) <- ASCII code of value typed on keyboard j label PC <- (label) la x label (x) <- address of label move x y (x) <- (y) mul x y z (x) = (y) * (z) nop no operation nor x y z (x) = (y) NOR (z) or x y z (x) = (y) OR (z) put x write (x) to the screen putc y write character with ASCII code = (y) on the screen puts string write (string) to the screen* rem x y z (x) <- (y) MOD (z) rol x y sa (x) <- (y) rotated left by sa bits ror x y sa (x) <- (y) rotated right by sa bits sll x y sa (x) <- (z) shifted left by sa bits sra x y sa (x) <- (z) shifted right by sa bits, sign-extended srl x y sa (x) <- (z) shifted right by sa bits, 0-extended sub x y z (x) <- (y) - (z) xor x y z (x) <- (y) XOR (z)

*string must be NULL terminated, like a variable declared with a .asciiz directive

Reference: James Goodman and Karen Miller, "A Programmer's View of Computer Architecture" (Saunders College Publishing, 1993) 2. ASSEMBLER DIRECTIVES

.data what follows are initial data .text what follows are instructions label: .ascii string A string is placed in memory (one byte per character) starting at location label. string is an ASCII string enclosed in double quotes (" "). label: .asciiz string A NULL-terminated string is placed in memory (one byte per character) starting at location label. string is an ASCII string enclosed in double quotes (" "). label: .byte value[:num_elements] An ASCII character (initially containing value) is located at label. A character value is specified by enclosing the character in single quotes. value is an optional field; if it is omitted, then the word at label is initialized to 0. If num_elements is present, it specifies the number of words to be allocated and initialized to value. label: .float value[:num_elements] AN IEEE single-precision floating- point number (initially containing value) is located at label. value is an optional field; if it is omitted, then the word at label is initialized to 0. If num_elements is present, it specifies the number of bytes to be allocated and initialized to value. label: .space number_of_bytes A section of memory consisting of number_of_bytes bytes and starting at location label is initialized and set aside. label: .word value[:num_elements] 2's complement integer (initially containing value) is located at label. value is an optional field; if it is omitted, then the word at label is initialized to 0. If num_elements is present, it specifies the number of bytes to be allocated and initialized to value. THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SPIM

Motivation for SPIM: . The MIPS simulator known as SPIM accepts human-readable MIPS assembly language instructions SPIM code uses labels and pseudoinstructions to hide some of the details of the actual hardware instructions Introduces a small amount of abstraction between the programmer and the hardware . Each SAL instruction translates to 1 or more MIPS assembly language instructions There’s also True Assembly Language (TAL) . TAL instructions correspond exactly to the hardware instructions produced by the assembler . Each SPIM instruction translates to 1 or more TAL instructions

c C. D. Cantrell (01/1999) MIPS INSTRUCTIONS, REGISTERS AND SYSTEM CALLS 1. INTEGER and GENERAL INSTRUCTIONS

Call Arg 1 Arg 2 Arg3 Description

* abs rd rs put the absolute value of rs into rd add rd rs rt rd = rs + rt (with overflow) addu rd rs rt rd = rs + rt (without overflow) addi rt rs imm rt = rs + imm (with overflow) addiu rt rs imm rt = rs + imm (without overflow) and rd rs rt put rs AND rt into rd * b label branch to label beq rs rt label branch to label if (rs==rt) * beqz rs label branch to label if (rs==0) *bgersrtlabelbranchtolabelif(rs>=rt) bgez rs label branch to label if (rs>=0) bgt rs rt label branch to label if (rs>rt) bgtz rs label branch to label if (rs>0) ble rs rt label branch to label if (rs<=rt) blez rs label branch to label if (rs<=0) * blt rs rt label branch to label if (rs=rt) rd=1;else rd=0 * sgt rd rs rt if (rs>rt) rd=1;else rd=0 * sle rd rs rt if (rs<=rt) rd=1;else rd=0 sll rd rt sa rd = rt shifted left by distance sa slt rd rs rt if (rs

* Indicates a pseudoinstruction (assembler generates more than one instruction to produce the same result) (a pseudoinstruction has no opcode) ------2. FLOATING POINT INSTRUCTIONS (single precision)

Call Arg 1 Arg 2 Arg3 Description

abs.s fd fs fd = absolute value of fs add.s fd fs ft fd = fs + ft bc1f label branch to label if float-flag is FALSE bc1t label branch to label if float-flag is TRUE c.eq.s fs ft if (fs==ft) then float-flag is TRUE, else its FALS c.le.s fs ft if (fs<=ft) then float-flag is TRUE,else its FALSE c.lt.s fs ft if (fs

------3a. CPU REGISTERS

Name Register Function

$a0 4 used to pass the first of four arguments $a1 5 used to pass the second of four arguments $a2 6 used to pass the third of four arguments $a3 7 used to pass the fourth of four arguments $at 1 reserved for the operating system $fp 30 frame pointer. $gp 28 global memory pointer $k0 26 reserved for the operating system $k1 27 reserved for the operating system $ra 31 return address $s0 16 callee-saved registers $s1 17 callee-saved registers $s2 18 callee-saved registers $s3 19 callee-saved registers $s4 20 callee-saved registers $s5 21 callee-saved registers $s6 22 callee-saved registers $s7 23 callee-saved registers $sp 29 stack pointer to the first free location $t0 8 used to hold temporary variables $t1 9 used to hold temporary variables $t2 10 used to hold temporary variables $t3 11 used to hold temporary variables $t4 12 used to hold temporary variables $t5 13 used to hold temporary variables $t6 14 used to hold temporary variables $t7 15 used to hold temporary variables $t8 24 used to hold temporary variables $t9 25 used to hold temporary variables $v0 2 used to pass values to and from functions $v1 3 used to pass values to and from functions 3b. FPU (CP1) REGISTERS

Name Register Function

$f0 (float) hold floating point type function results $f2 (float) hold floating poitn type function results $f4 (float) temporary register $f6 (float) temporary register $f8 (float) temporary register $f10 (float) temporary register $f12 (float) pass the first of two float arguments $f14 (float) pass the second of two float arguments $f16 (float) temporary register $f18 (float) temporary register $f20 (float) saved register $f22 (float) saved register $f24 (float) saved register $f26 (float) saved register $f28 (float) saved register $f30 (float) saved register

3c. CP0 REGISTERS

Register Function

8 BadVaddr (Bad Virtual Address) 12 Status 13 Cause 14 EPC (Exception ) 4. SYSTEM CALL INFORMATION

Call Code Event Arguments Result (in $v0)

1 print int $a0 = integer $a0 is printed out 2 print float $f12 = float $f12 is printed out 4 print string $a0 = pointer to string string is printed out 5 read int $v0 holds integer read 6 read float $f0 holds float read 7 read double $f0 holds double read 8 read string $a0 = buffer,a1 = length string is read from console 9sbrk$a0=amountaddress of additional memory in $v0 10 exit program 11 print byte $a0 = byte byte is printed out

------

Originally due to Eric Gauthier Modified and corrected by C. D. Cantrell THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS DATA TYPES

Integer (several dierent sizes) Floating point (2 sizes) Character (really 1-byte unsigned integer)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VARIABLE DECLARATIONS (1)

• Static variables and arrays belong in a section that begins with .data • Character and numerical variables:

C SPIM/SAL char c; c: .byte value:num elements int n; n: .word value:num elements float x; x: .float value:num elements

Initial values are optional value must be appropriate for the type of data being stored: ◦ byte: value is an ASCII character enclosed in single quotes(’’) ◦ For word, value must be an integer ◦ Example for float: value is 3.2767e4 (= 32, 767)

c C. D. Cantrell (10/1999) # SPIM program declare.s # Data segment .data ch: .byte 'C' # C equivalent: char ch; # ch = 'C'; chint: .word # Reserves space for ch considered as # an integer posint: .word 32767 # C equivalent: int posint; # posint = 32767; negint: .word -32767 # C equivalent: int negint; # negint = -32767; f: .float 3.2767e4 # C equivalent: float f; # f = 3.2767e4; fint: .word # Reserves space for f considered as # an integer newline: .byte '\n' # Text segment (instructions) .text __start: # Start of arithmetic & logical operations # # The first block of code prints ch as a character # This SPIM code is equivalent to the C command printf("%c\n",ch); # lb $a0, ch # Load the byte labeled ch into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints the byte as a character, on the console # See the system call info in spim.inst.txt !!! lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # See system call info!!! # # The next block of code copies the contents of ch into the location # labeled chint # lb $a0, ch # Load the byte labeled ch into register $a0 # The byte will be right-justified in the register sw $a0, chint # Store the word from register $a0 to the location # labeled chint (this step is necessary, # because we plan to re-use chint; see below) # # The next block of code prints out the value of chint as an integer # (not a character) # This SPIM code is equivalent to the C command printf("%d\n",ch); # li $v0, 1 # Put the integer 1 into register $v0 syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints chint as a floating-point number # This SPIM code is equivalent to the C command printf("%e\n",ch); # lwc1 $f12, chint # Load the value at location chint into register # $f12, which is in coprocessor 1, so we can't just # use lw li $v0, 2 # Put the integer 2 into register $v0 syscall # This prints the floating-point number on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of posint as an integer # This SPIM code is equivalent to the C command printf("%d\n",posint); # lw $a0, posint # This loads the value stored at posint into $a0 li $v0, 1 # Put the integer 1 into register $v0 syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of posint as a floating-point # number # This SPIM code is equivalent to the C command printf("%e\n",posint); # lwc1 $f12, posint # Load the value at location posint into register # $f12, which is in coprocessor 1, so we can't just # use lw li $v0, 2 # Put the integer 2 into register $v0 syscall # This prints the floating-point number on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of negint as an integer # This SPIM code is equivalent to the C command printf("%d\n",negint); # lw $a0, negint # This loads the value stored at negint into $a0 li $v0, 1 # Put the integer 1 into register $v0 syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of negint as a floating-point # number # This SPIM code is equivalent to the C command printf("%e\n",negint); # lwc1 $f12, negint # Load the value at location negint into register # $f12, which is in coprocessor 1, so we can't just # use lw li $v0, 2 # Put the integer 1 into register $v0 syscall # This prints the floating-point number on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of a pointer to posint # This SPIM code is equivalent to the C commands # register p; p = &posint; printf("%p\n",p); # la $a0, posint # This loads the address of posint into $a0 # C equivalent: register p; p = &posint; li $v0, 1 # Put the integer 1 into register $v0 syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of the floating-point number # stored at f, on the console # This SPIM code is equivalent to the C command printf("%6.6f\n",f); # lwc1 $f12, f # Load the value at location f into register # $f12, which is in coprocessor 1, so we can't just # use lw li $v0, 2 # Put the integer 2 into register $v0 syscall # This prints the floating-point number on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console # # The next block of code prints the value of the floating-point number # stored at f, but interpreted as an integer, on the console # This SPIM code is equivalent to the C command printf("%d\n",f); # lwc1 $f0, f # Load the value at location f into register # $f0, which is in coprocessor 1, so we can't just # use lw mfc1 $a0, $0 # This does a bit-for-bit copy from $f0 to the # register $a0 in the CPU sw $a0, fint # Store the bitwise copy of f into location fint li $v0, 1 # Put the integer 1 into register $v0 syscall # This prints the copy of f on the console, # as an integer lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 syscall # This prints a newline on the console li $v0, 10 # Put the integer 10 into register $v0 syscall # Return control to the OS # io.s # A simple SPIM program to illustrate input-output operations # using the console # # Please load this program into SPIM and single-step through it, # while watching the register, text and data windows, and the console # # Data segment # .data intdec: .word 12345 inthex: .word 0x12345 temp: .word newline: .byte '\n' # # Text (instruction) segment # .text __start: lw $a0, intdec # This loads the value stored at location intdec # (namely, 12345) into register $a0 li $v0, 1 # Put the integer 1 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints a newline on the console # # lw $a0, inthex # This loads the value stored at location inthex # (namely, 0x12345) into register $a0 li $v0, 1 # Put the integer 1 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints a newline on the console # # li $v0, 11 # Put the integer 11 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints a newline on the console ori $v0, $0, 5 # Put the integer 5 into register $v0 syscall # This reads an integer from the console into # register $v0 sw $v0, temp # and this stores the integer into the location temp lw $a0, temp # This loads the value stored at location temp # (the value entered from the console) into register $a0 li $v0, 1 # Put the integer 1 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints the contents of $a0 as an integer, # on the console lb $a0, newline # Load the byte labeled newline into register $a0 li $v0, 11 # Put the integer 11 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints a newline on the console # # la $a0, intdec # This loads the address of location intdec # into register $a0, which now contains a pointer to # intdec li $v0, 1 # Put the integer 1 into register $v0 # (note that we do not have to load anything from # memory) syscall # This prints the contents of $a0 as an integer, # on the console # # li $v0, 10 # Put the integer 10 into register $v0 syscall # Return control to the OS # Read the system call info at the end of spim.inst.txt # if you don't understand this step!!!! # A good exam question: What is the final value # stored in the location labeled temp? THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALIGNMENT

In MIPS assembler (and in most RISC assemblers), the address of the low-address byte of a block of memory that holds a data type must be a multiple of the size of the data type . The address of a byte can be any unsigned integer within the ad- dress space . The address of a word (4 bytes) must be a multiple of 4 A word address ends with 2 zero bits (00) Possible last hexadecimal digits in a word address: 0, 4, 8, C . The address of a doubleword (8 bytes) must be a multiple of 8 A doubleword address ends with 3 zero bits (000) Possible last hexadecimal digits in a doubleword address: 0, 8 The directive .align n aligns block addresses on multiples of n

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VARIABLE DECLARATIONS (2)

• Strings in SPIM and SAL:  label: .asciiz string — Puts a null-terminated string into mem- ory starting at location label. string is an ASCII string enclosed in double quotes ("")  label: .ascii string — Puts a string into memory starting at location label. string is an ASCII string enclosed in double quotes (" ") • Why put a null byte (0x00) at the end of a string?  C convention: A string is scanned up to a null byte (\0)  String length is determined only when data is read

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSEMBLERS (5)

• Parts of a MIPS assembly-language program (continued): Assignmentstatements ◦ In a higher-level language, an assignment statement changes the value of a variable ◦ When translated into assembly language, assignment statements be- come arithmetic, or logical, or data-transfer instructions ◦ Form of an assembly-language arithmetic instruction: operation destination, source1, source2 Loops ◦ Must provide equivalents for C loops that use while and for

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSIGNMENT STATEMENTS (1)

• An assignment statement changes the value of a variable Belongs in a segment that begins with .text

C SAL action z=y; move z,y z ← (y) z=y+x; add z,y,x z ← (y)+(x) z=y-x; sub z,y,x z ← (y) − (x) z = y*x; mul z,y,x z ← (y) · (x) z = y/x; div z,y,x z ← (y)/(x) z = y%x; rem z,y,x z ← (y)(mod(x)) z = (type)y; cvt z,y z ← (y) (with type conversion)

SAL allows memory-resident operands;SPIM does not (y) means the contents of the memory location with label y z← (y) means, “Store the contents of y into location z” c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASSIGNMENT STATEMENTS (2)

• All assignment statements in SPIM must use operands that are in registers

C SPIM action lw $rs, y z=y; sw $rs, z z ← (y) lw $rs, x lw $rt, y z=y+x;add $rd, $rs, $rt z ← (y)+(x) sw $rd, z

(y) means the contents of the memory location with label y x, y and z must be labels defined in a .data directive The same instruction sequence shown for add applies to other arithmetic instructions such as sub, mul, div, and rem

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REGISTERS (1)

A register is an array of ip-ops with special properties: . Registers are the fastest kind of addressable memory . Registers are located on the same die as the processor Registers are NOT located in main memory . The set of registers in a processor is called the register le Registers are addressed separately & dierently from main memory . MIPS R2000 architecture: Main memory addresses are 32 bits 4,294,967,296 addressable 8-bit bytes ⇒ Register addresses are 5 bits 32 addressable 32-bit general-purpose registers ⇒ Register addresses run from 00000 =0 to 11111 =31 2 10 2 10 MIPS assembler: Register addresses are written $0, ..., $31 c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REGISTERS (2)

Why are registers necessary? . The processor needs a “scratch pad” for operands and intermediate results . Some data items must be accessed in 1 clock period to avoid wait states Why isn’t the register le a part of main memory? . Really fast memory is really expensive number of words in registers number of words in main memory ⇒ . Really fast memory is really hot major heat-dissipation problem, even if you’re rich! ⇒ . Propagation velocity of a signal 6 inches per nanosecond fast memory must be physically close to the ALU &

c C. D. Cantrell (01/1999) Register Usage

8/14/91 John Hennessy Immediate addressing instruction contains the operand

Instr Op 3

load #3, .... Direct addressing instruction contains address of operand

Memory

Instr Op Addr of A Operand load A, .... Indirect addressing instruction contains address of address of operand

Memory Address of address of A Operand Instr Op Operand addr load (A), .... Register direct addressing register contains operand

Instr OpR1 . . .

R1 Operand load R1, .... Register indirect addressing register contains address of operand

Memory Instr OpR2 . . .

R2 Operand addr Operand

load [R2], .... Displacement (based or indexed) addressing address of operand = register + constant Memory

Instr OpR2 4

Operand

R2

Operand address load 4[R2], .... 30%

Integer average 25%

Floating-point average 20% Percentage of displacement 15%

10%

5%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of bits

Percent of total displacement addresses vs. number of bits in displacement Relative addressing address of operand = PC + constant

Memory

Instr Op 4

Operand

PC

Operand address loadrel 4[PC], .... 40% Floating-point average 35%

30%

25% Integer average 20%

15%

10%

5%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Bits of branch displacement

Percent of branch instructions vs. number of bits in branch displacement Addressing Mode Usage

Frequency

Autodecrement 1%

Displacement Deferred 1%

Autoincrement 4%

Index 6%

Register Deferred 9%

Literal 18%

Displacement 21%

Register 41%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

8/14/91 John Hennessy MIPS addressing modes

1. Immediate addressing op rs rt Immediate

2. Register addressing op rs rt rd . . . funct Registers Register

3. Base addressing op rs rt Address Memory

Register + Byte Halfword Word

4. PC-relative addressing op rs rt Address Memory

PC + Word

5. Pseudodirect addressing op Address Memory

PC Word THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOAD-STORE ARCHITECTURES

All instructions other than load and store operate only on: . Data in registers . Immediate data (data encoded in the assembled instruction) Advantages of RISC load-store architectures: . Speed ALU operations execute in 1 clock period . Small number of addressing modes MIPS processors (RISC): 1 form of add instruction VAX processors (CISC): 20,000 forms of add instruction . Simple control unit Short clock period Easily & quickly scalable IC design

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS ADDRESSING MODES (1)

Register direct addressing: R-format instructions add Rd, Rs, Rt Addition (with overow)

0 Rs Rt Rd 0 0x20 655556 addu Rd, Rs, Rt Addition (without overow) 0 Rs Rt Rd 0 0x21 655556

. Rs, Rt, Rd are addresses within the register le

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

R-FORMAT INSTRUCTIONS

5-bit register 6-bit function address code 6-bit opcode (Example: (Example: { (0 for R format) 00011 =$3) 101010{ = slt)

2 { 2

Op Rs Rt Rd Shamt Funct 6 5555 6 { { 0 { 5-bit shift 0 for amount for sll; shifts for shifts 2 (Example: for srl; 101112=2310) 3 0 for for sra all other instructions c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS ADDRESSING MODES (2)

Load and store instructions move data between main memory and the register le . lw $Rt, address copies a 32-bit word from address in main mem- ory to register $Rt . sw $Rs, address copies a 32-bit word from register $Rs to address in main memory

Syntax for address Meaning Used when: label Address of label label is the name of a location in memory label Imm Address of label Imm Value of Imm is known ($Rn) Contents of $Rn Full address is in $Rn Imm($Rn) Imm + contents of $Rn Value of Imm is known label Imm($Rn) Address of label $Rn contains oset [Imm + contents of $Rn] from address of label c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS ADDRESSING MODES (3)

Displacement (indexed) addressing: I-format instructions

Op Rs Rt Oset 655 16

. Rs, Rt, Rd are 5-bit addresses within the register le . Offset is a signed 16-bit integer (two’s complement representation) Range of Offset is 0x8001 = 32, 767 to 0x7fff =32, 767 10 10 If Offset were interpreted as an unsigned integer address in main memory, there would be only 64k = 65,536 addressable locations (as in the Intel 8080 and Motorola 6502)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS ADDRESSING MODES (4)

How to use 32-bit or 64-bit addresses when the instruction length is only 32 bits???

Op Rs Rt Oset 655 16

Base-oset (indexed) addressing in the MIPS R2000 processor: Address in memory = [32-bit address in register Rs]+Offset . Register Rs holds a 32-bit base address . Offset is a signed 16-bit integer (two’s complement representation) . The eective address in main memory can be anywhere in a 64k segment centered on the base address

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

I-FORMAT INSTRUCTIONS

6-bit opcode Example:

100011{ 2 = lw

Op Rs Rt Offset 655 16

5-bit register{ 16-bit, two’s address complement offset Example: Example: − 001002 = $4 11111111111101002 = 12

c C. D. Cantrell (01/1999) lw $4, 12 ($1) { { { will hold offset holds contents (coded in base of memory instruction) address location (must fit (32 bits) at in 16 bits) address 12 + ($1)

1000 11 00 001 0 0100 0000 0000 0000 1100 lw $1 $4 12

Assembled instruction: 0x8c24000c

c C. D. Cantrell (05/1997) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS REGISTERS (1)

Register Name or Value Purpose PC Program Counter Address of next instruction $0 Value always 0 Comparisons; address construction $1 $at Assembler temporary register $2 $v0 Value to be passed to function $3 $v1 Value to be passed to function $4 $a0 First argument $5 $a1 Second argument $6 $a2 Third argument $7 $a3 Fourth argument

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS TEMPORARY REGISTERS

Hold temporary variables in a procedure that calls other procedures . Programming convention: Not preserved across a procedure call Register Name Purpose $8 $t0 Holds a local variable $9 $t1 Holds a local variable $10 $t2 Holds a local variable $11 $t3 Holds a local variable $12 $t4 Holds a local variable $13 $t5 Holds a local variable $14 $t6 Holds a local variable $15 $t7 Holds a local variable $24 $t8 Holds a local variable $25 $t9 Holds a local variable

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS SAVED TEMPORARY REGISTERS

Hold temporary variables used by more than one procedure . Programming convention: Preserved across a procedure call

Register Name Purpose $16 $s0 Holds a saved temporary variable $17 $s1 Holds a saved temporary variable $18 $s2 Holds a saved temporary variable $19 $s3 Holds a saved temporary variable $20 $s4 Holds a saved temporary variable $21 $s5 Holds a saved temporary variable $22 $s6 Holds a saved temporary variable $23 $s7 Holds a saved temporary variable

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OTHER MIPS GENERAL-PURPOSE REGISTERS

Register Name Purpose $26 $k0 Reserved for use by OS kernel $27 $k1 Reserved for use by OS kernel $28 $gp Pointer to global area $29 $sp Pointer to rst free location on stack $30 $fp Frame pointer $31 $ra Return address

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS SPECIAL-PURPOSE REGISTERS (1)

• Program counter (PC)  Holds the address of the next instruction to be executed  Modified by branch and jump instructions

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS SPECIAL-PURPOSE REGISTERS (2)

• Floating-point registers (located in coprocessor 1, the FPU)  Used as a local “scratch pad” by the FPU  In the MIPS R2000 ISA, there are 32 32-bit floating-point registers ($f0–$f31) • BadVaddr register (coprocessor 0, register 8)  Memory address at which an addressing exception occurred • Status register (coprocessor 0, register 12)  Interrupt mask and interrupt enable bits  Kernel/user bits for old, previous and current processes • Cause register (coprocessor 0, register 13)  Holds a code for the cause of an exception • Exception program counter (EPC) (coprocessor 0, register 14)  Holds address of instruction that caused an exception c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS JUMP and BRANCH INSTRUCTIONS (1)

Control transfer instructions determine program ow by altering the contents of the program counter (PC) register . Jump register (jr Rs) loads the PC with the contents of Rs

0 Rs 0 0 0 0x8 6 5555 6 Used to return from function calls . Jump (j Address) loads the PC with the 26-bit contents of Address, multiplied by 4 to give a 28-bit word address in the low 268,435,456 bytes (= 256 MB) of main memory

0x1 Address 626

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS JUMP and BRANCH INSTRUCTIONS (2)

Relative addressing is used to implement conditional transfers . Branch instructions are conditional (transfer control only if certain conditions are met, as in “look before you leap”) Used for loop control Example: branch when equal (beq Rs, Rt, Target) Branch to instruction at label Target is taken when contents of Rs = contents of Rt In the assembled instruction, Offset = 1(address of Target value in PC when beq executes) 4

0x4 Rs Rt Oset

655 16c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS JUMP and BRANCH INSTRUCTIONS (3)

• Example of a conditional branch in program init1d.s [0x00400028] 0x01890018 mult $12, $9 # top of loop [0x0040002c] 0x00007012 mflo $14 # offset=index*size [0x00400030] 0x01c86820 add $13, $14, $8 # addr=base+offset [0x00400034] 0xadb80000 sw $24, 0($13) # value -> address [0x00400038] 0x218c0001 addi $12, $12, 1 # index is in $12 [0x0040003c] 0x016c7822 sub $15, $11, $12 # max. array index #isin$11 [0x00400040] 0x05e1fffa bgez $15 -24 # loop again if # index <= maximum  Arithmetic of PC-relative addressing: PC when bgez starts to execute contains 0x00400040; offset is shown as −2410 = −1816 (bytes, not instructions) In assembled bgez, offset is 0xfffa = −6two instructions Branch target is at address 0x00400028; let’s check the arithmetic: 0x00400040 (branch) − 0x18 (offset) = 0x00400028 (target) (checks)

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOOPS (1)

• A loop is a block of statements or instructions that may be executed more than once A fundamental programming structure ◦ Present in all programming languagues ◦ Necessary for all matrix-vector (linear algebra) algorithms ◦ Necessary in all differential-equation solvers Loop control statements or instructions ◦ Loop initialization ◦ Incrementingthe induction variable (if there is one) ◦ Exitingfrom the loop

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOOPS (2)

• Loop syntax in C: For loop: for ( init; test; update ) { statements }

While loop: while ( expression ) { statements }

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOOPS (3)

• Loops in assembly languages: Initialization before loop ◦ Set a counter to control the number of times the loop is executed Inside the loop: ◦ ALU or floating-point instructions ◦ Decrement or increment the counter Test and branch ◦ If yes, then transfer control to the top of the loop instruction ◦ If no, then fall through the bottom of the loop ◦ SPIM/SAL instructions: blt, beq, bgt compare two numbers and branch if “true” bltz, beqz, bgtz compare a number to 0 and branch if “true” If ($s0) < ($s1), slt $t0,$s0,$s1 followed by bne $t0,$0,Target sets ($t0) to 1 and then branches to Target

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOOPS (4)

• Example of loop programming in SAL:

# SAL code to initialize the elements of a 1-d array # with the value stored in location val .data ar1: .word 0:20 # array of 20 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements base: .word 0 # will hold address of ar1[0] index: .word 0 # array index; initialized here to 0 nmax: .word 19 # no. of array elements - 1; max. value # of array index size: .word 4 # size of an array element, in bytes addr: .word 0 # absolute address of ar1[index] offset: .word 0 # offset from first array element n: .word # .text __start: la base,ar1 # get absolute address of ar1[0] # addresses of other array elements # will be computed as base + offset loop: mul offset,index,size # loop begins here add addr,offset,base # compute absolute address of ar1[index] move M[addr],val # store val in ar1[index] add index,index,1 # increment array index sub n,nmax,index # subtract array index from nelems bgez n, loop # branch back to loop if diff >= 0 done

c C. D. Cantrell (01/1999) # SPIM code to initialize the elements of a 1-d array # with the value stored in location val

.data ar1: .word 0:20 # array of 20 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements nelems: .word 20 # no. of array elements - 1; max. value # of array index size: .word 4 # size of an array element, in bytes

# Register usage # For clarity, each register holds only one variable # Nobody would program in this way, because some registers can be re-used # # $t0 base address (address of ar1[0]) # $t1 size = size of an array element (in bytes) # $t2 nelems = number of array elements # $t3 nmax = nelems - 1 = max. value of array index # $t4 index = array index of current element # $t5 addr = absolute address of ar1[index] # $t6 offset = index * size # $t7 n = counter to be decremented # $t8 val = value to be stored in each array element

.text __start: la $t0,ar1 # get absolute address of ar1[0] # addresses of other array elements # will be computed as base + offset lw $t1,size lw $t2,nelems addi $t3,$t2,-1 # calculate nmax = nelems - 1 ori $t4, $0, 0 # initialize index to 0 lwc1 $f0, val mfc1 $t8, $0 loop: mul $t6,$t4,$t1 # loop begins here; offset in bytes add $t5,$t6,$t0 # compute absolute address of ar1[index] sw $t8,0($t5) # store val in ar1[index] addi $t4,$t4,1 # increment array index sub $t7,$t3,$t4 # subtract array index from (nelems - 1) bgez $t7, loop # branch back to loop if diff >= 0 ori $v0,$0,10 syscall ASSEMBLED TEXT FOR init1d.s

[0x00400000] 0x3c081001 lui $8, 4097 # la $t0,ar1 [0x00400004] 0x3c011001 lui $1, 4097 [0x00400008] 0x8c290058 lw $9, 88($1) # lw $t1,size [0x0040000c] 0x3c011001 lui $1, 4097 [0x00400010] 0x8c2a0054 lw $10, 84($1) # lw $t2,nelems [0x00400014] 0x214bffff addi $11, $10, -1 # addi $t3,$t2,-1 [0x00400018] 0x340c0000 ori $12, $0, 0 # ori $t4, $0, 0 [0x0040001c] 0x3c011001 lui $1, 4097 [0x00400020] 0xc4200050 lwc1 $f0, 80($1) # lwc1 $f0, val [0x00400024] 0x44180000 mfc1 $24, $0 # mfc1 $t8, $0 [0x00400028] 0x01890018 mult $12, $9 [0x0040002c] 0x00007012 mflo $14 # mul $t6,$t4,$t1 [0x00400030] 0x01c86820 add $13, $14, $8 # add $t5,$t6,$t0 [0x00400034] 0xadb80000 sw $24, 0($13) # sw $t8,0($t5) [0x00400038] 0x218c0001 addi $12, $12, 1 # addi $t4,$t4,1 [0x0040003c] 0x016c7822 sub $15, $11, $12 # sub $t7,$t3,$t4 [0x00400040] 0x05e1fffa bgez $15 -24 # bgez $t7, loop [0x00400044] 0x3402000a ori $2, $0, 10 # ori $v0,$0,10 [0x00400048] 0x0000000c syscall # syscall # init2d.s # SPIM code to initialize the elements of a 2-d array # with the value stored in location val .data ar2: .word 0:12 # array of 12 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements nrows: .word 4 # no. of rows ncols: .word 3 # no. of columns size: .word 4 # size of an array element, in bytes

# Register usage # For clarity, each register holds only one variable # Nobody would program in this way, because some registers can be re-used # # $t0 base address (address of ar2[0][0]) # $t1 size = size of an array element (in bytes) # $t2 nrows = number of rows # $t3 ncols = number of columns # $t4 nrmax = nrows - 1 = max. value of row index # $t5 ncmax = ncols - 1 = max. value of column index # $t6 rindex = row index # $t7 cindex = column index # $t8 addr = absolute address of ar2[rindex][cindex] # $t9 roffset = rindex * ncols * size # $s0 coffset = cindex * size # $s1 offset = roffset + coffset # $s2 nc = column counter to be decremented # $s3 nr = row counter to be decremented # $s4 value to be stored in each array element

.text __start: la $t0, ar2 lw $t1, size lw $t2, nrows lw $t3, ncols addi $t4, $t2, -1 # nrmax addi $t5, $t3, -1 # ncmax ori $t6, $0, 0 # initialize row index to 0 lwc1 $f0, val mfc1 $s4, $0 rloop: mul $t9, $t6, $t3 # multiply rindex by ncols mul $t9, $t9, $t1 # multiply by size of one array element to get roffset ori $t7, $0, 0 # initialize column index to 0 cloop: mul $s0, $t7, $t1 # multiply cindex by size to get coffset add $s1, $s0, $t9 # offset of ar2[rindex][cindex] = roffset + coffset add $t8, $s1, $t0 # address of ar2[rindex][cindex] = offset + base sw $s4, 0($t8) # store val in ar2[rindex][cindex] addi $t7, $t7, 1 # increment the column index sub $s2, $t5, $t7 # nc = ncmax - cindex bgez $s2, cloop # branch back to cloop if nc >= 0 addi $t6, $t6, 1 # increment the row index sub $s3, $t4, $t6 # nr = nrmax - rindex bgez $s3, rloop # branch back to rloop if nr >= 0 ori $v0, $0, 10 # reach here if row loop is done syscall # end of program! ASSEMBLED TEXT FOR init2d.s

[0x00400000] 0x3c081001 lui $8, 4097 # la $t0, ar2 [0x00400004] 0x3c011001 lui $1, 4097 [0x00400008] 0x8c29003c lw $9, 60($1) # lw $t1, size [0x0040000c] 0x3c011001 lui $1, 4097 [0x00400010] 0x8c2a0034 lw $10, 52($1) # lw $t2, nrows [0x00400014] 0x3c011001 lui $1, 4097 [0x00400018] 0x8c2b0038 lw $11, 56($1) # lw $t3, ncols [0x0040001c] 0x214cffff addi $12, $10, -1 # addi $t4, $t2, -1 [0x00400020] 0x216dffff addi $13, $11, -1 # addi $t5, $t3, -1 [0x00400024] 0x340e0000 ori $14, $0, 0 # ori $t6, $0, 0 [0x00400028] 0x3c011001 lui $1, 4097 [0x0040002c] 0xc4200030 lwc1 $f0, 48($1) # lwc1 $f0, val [0x00400030] 0x44140000 mfc1 $20, $0 # mfc1 $s4, $0 [0x00400034] 0x01cb0018 mult $14, $11 [0x00400038] 0x0000c812 mflo $25 # mul $t9, $t6, $t3 [0x0040003c] 0x03290018 mult $25, $9 [0x00400040] 0x0000c812 mflo $25 # mul $t9, $t9, $t1 [0x00400044] 0x340f0000 ori $15, $0, 0 # ori $t7, $0, 0 [0x00400048] 0x01e90018 mult $15, $9 [0x0040004c] 0x00008012 mflo $16 # mul $s0, $t7, $t1 [0x00400050] 0x02198820 add $17, $16, $25 # add $s1, $s0, $t9 [0x00400054] 0x0228c020 add $24, $17, $8 # add $t8, $s1, $t0 [0x00400058] 0xaf140000 sw $20, 0($24) # sw $s4, 0($t8) [0x0040005c] 0x21ef0001 addi $15, $15, 1 # addi $t7, $t7, 1 [0x00400060] 0x01af9022 sub $18, $13, $15 # sub $s2, $t5, $t7 [0x00400064] 0x0641fff9 bgez $18 -28 # bgez $s2, cloop [0x00400068] 0x21ce0001 addi $14, $14, 1 # addi $t6, $t6, 1 [0x0040006c] 0x018e9822 sub $19, $12, $14 # sub $s3, $t4, $t6 [0x00400070] 0x0661fff1 bgez $19 -60 # bgez $s3, rloop [0x00400074] 0x3402000a ori $2, $0, 10 # ori $v0, $0, 10 [0x00400078] 0x0000000c syscall # syscall THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (1)

• Let u represent an unsigned, n-bit binary integer,

u = bn−1 ···b0 i bi = coefficient of 2 • Multiply u by a positive power of 2: k ··· ··· ··· 2 u = bn−1 bn−k bn−k−1 b0 00 0 overflow bits shifted by k k zeros Accomplished by the SPIM instruction sll $rd,$rs,sa where sa is the number of places to shift left $rs is the source $rd is the destination Example: In a jump instruction, a 26-bit offset (in instructions) is shifted left 2 bits to make a byte offset

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (2)

• Let u represent an unsigned, n-bit binary integer,

u = bn−1 ···b0 i bi = coefficient of 2 • Multiply u by a negative power of 2 & take the integer part:  −k  ··· ··· 2 u =00 0 bn−1 bn−k k zeros shifted by k

 Bits bn−k−1 ···b0 are shifted out  Accomplished by the SPIM instruction “shift right logical”: srl $rd,$rs,sa where sa is the number of places to shift right $rs is the source $rd is the destination  The k places vacated on the left are filled with zeros

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (3)

• Let the two’s complement, n-bit binary integer  ···  n s = bn−1bn−2 b0 = bn−12 + i represent the integer bn−1 i =(−1) bn−2 ···b0 • Multiply i by a negative power of 2 & take integer part:

 bn−1 −k result = i = (−1) 2 bn−2 ···b0 • The two’s complement integer that represents i is  n  s = bn−12 + i which has k leading 0’s if i>0, and k leading 1’s if i<0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (4)

• Example of shifting a two’s complement representation of an integer when n =8: i = −12510 = −7D16 = −0111 1101 The two’s complement representative of i is 1000 0010 + 1 = 1000 0011 Multiply by 2−k, where k =3: 125 5 ⇒ −125 − − − 8 =158 8 = 16 = 1016 = 0001 0000 The two’s complement representative of −16 is 1110 1111 + 1 = 1111 0000 = result of shifting 1000 0011 right by 3 places and filling the 3 empty places on the left with 1’s

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (5)

• Let u represent an two’s complement, n-bit binary integer,

u = bn−1 ···b0 i n−1 bi = coefficient of 2 for i ∈ (0 : n − 2),bn−1 = coefficient of − 2 • Multiply u by a negative power of 2 & take the integer part:  −k  ··· ··· 2 u = bn−1bn−1 bn−1 bn−1 bn−k k bits equal to bn−1 shifted by k

 Bits bn−k−1 ···b0 are shifted out  Accomplished by the SPIM instruction “shift right arithmetic”: sra $rd,$rs,sa where sa is the number of places to shift right $rs is the source $rd is the destination

 The k places vacated on the left are filled with bits equal to bn−1 (sign extension) c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SHIFTING (6)

• Shift operators in C: destination = number shiftop n where number is an integer, n is the number of places to shift (in base 2), and >>, if a right shift; shiftop = <<, if a left shift. A C left shift is equivalent to MIPS sll A C right shift is equivalent to MIPS srl,ifnumber is unsigned or is positive For other integer or logical data, the result of a C right shift is implementation- dependent ◦ May be equivalent to MIPS sra

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (1)

• Form of SPIM logical instructions: op destination, mask, source where op is and, or, nor or xor • Effects of logical operations on eachbit: mask bit → 0 1 AND clear copy OR copy set NOR toggle clear XOR copy toggle  Clear means “set destination bit equal to 0”  Set means “set destination bit equal to 1”  Toggle means “set destination bit equal to 1 if source bit is 0, and set destination bit equal to 0 if source bit is 1” (logical negation)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (2)

Example of AND: 1001 0011 (source) AND 1111 0000 (mask) 1001 0000 (destination) The eect is to copy the high nibble and clear the low nibble Example of OR: 1001 0000 (source) OR 0000 0011 (mask) 1001 0011 (destination) The eect is to set bits 0 and 1 and copy the remaining bits

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (3)

Example of NOR: 1001 0011 (source) NOR 1111 0000 (mask) 0000 1100 (destination) The eect is to clear the high nibble and toggle the low nibble Example of XOR: 1001 0011 (source) XOR 1111 0000 (mask) 0110 0011 (destination) The eect is to toggle the high nibble and copy the low nibble

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (4)

Use AND to pick out the fraction of a single-precision number: source:

31 30 23 22 0 0 1000110101010101010101010101010

4 6 a a a a a a | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } mask:

31 30 23 22 0 0 0000000011111111111111111111111

0 0 7 f f f f f | {z } | {z } | {z } | {z } | {z } | {z } | {z } | {z } destination:

31 30 23 22 0 0 0000000001010101010101010101010

0 0 2 a a a a a | {z } | {z } | {z } | {z } | {z } | {z } | c {zC. D. Cantrell} | (01/1999){z } THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (5)

Previous slide: 0x46aaaaaa AND 0x007fffff = 0x002aaaaa . The eect is to extract the fraction as an unsigned integer (without the implicit 1 bit) AND can also be used to extract the sign bit: 0xc6aaaaaa AND 0x80000000 = 0x80000000 (By the way — 0x46aaaaaa represents 21,845.33, and 0xc6aaaaaa represents 21,845.33)

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LOGICAL OPERATIONS ON WORDS (6)

C syntax: destination = mask op source where op is: NOT AND & OR XOR ˆ| . Example: number = 0x8000 0000 number sets the sign bit of number | destination =˜source puts the bitwise logical negation of source into destination . Example: ˜1 has all bits set except for bit 0

c C. D. Cantrell (01/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (1)

Array: A set of data blocks (bytes, words, arrays, ...) of equal size, stored contiguously in memory . Memory is an array of bytes or words (depending upon access mode) In SAL: To refer to the byte located at address, use m[address] To refer to the word starting at address, use M[address] (where address is a multiple of 4) . The general-purpose registers are an array (the register le) Registers can be addressed by number ($0–$31); for example, add $8, $9, $10 . The oating-point registers are also an array Must be addressed in pairs ($f0, $f2, $f4, ... , $f30)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (2)

Arrays are built into higher-level languages . In FORTRAN: To declare a one-dimensional array: real*4 f(2000) To store a value in an array element: f(i) = 1.e-3 * float(i) . In C: To declare a one-dimensional array: float f[2000]; To declare a pointer array: float *fp[2000]; To store a value in an array element: f[i] = 1.e-3 * ((float) i);

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (3)

Uses of arrays in EE & physics . Sampled values of a function One dimension: f1[i] = f1(xi) Two dimensions: f2[i][j] = f2(xi,yj)ormatrix element fi,j Three dimensions: f3[i][j][k] = f3(xi,yj,zk) Four dimensions: f4[i][j][k][l] = f4(xi,yj,zk,tl) . Strings . Pointer arrays c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (4)

• Assemblers don’t know about arrays  Reserve space for an array of 2000 integers in SAL/SPIM: ar: .word 0:2000  Load starting address of array ar into register $v0: la $v0, ar  To access an array element, must give its address as base + offset ◦ Computation of the address of ar[i] by the C compiler: address of ar[i] = address of ar[0] + sizeof(ar[0]) * i base offset ◦ Computation of the address of ar(i) by the FORTRAN compiler: address of ar(i) = address of ar(1) + sizeof(ar(1)) * (i-1) ◦ sizeof(f(1)) = number of bytes occupied by the element ar(1)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (5)

Example of storing a number in an array element in SAL: faddr: .word # pointer to f[0] i: .word # array index iaddr: .word # pointer to f[i] offset: .word # offset of f[i] from f[0], in bytes f: .word 0:80 # 80 = no. of array elements step: .word 5 x: .word . . mul x, step, i # x = step * i; la faddr, f # get address of f[0] mul offset, i, 4 # compute offset of f[i] IN WORDS!!!!! add iaddr, offset, faddr # compute absolute address of f[i] move M[iaddr], x # f[i] = step * i;

c C. D. Cantrell (02/1999) # SPIM code to initialize the elements of a 1-d array # with the value stored in location val

.data ar1: .word 0:20 # array of 20 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements nelems: .word 20 # no. of array elements - 1; max. value # of array index size: .word 4 # size of an array element, in bytes

# Register usage # For clarity, each register holds only one variable # Nobody would program in this way, because some registers can be re-used # # $t0 base address (address of ar1[0]) # $t1 size = size of an array element (in bytes) # $t2 nelems = number of array elements # $t3 nmax = nelems - 1 = max. value of array index # $t4 index = array index of current element # $t5 addr = absolute address of ar1[index] # $t6 offset = index * size # $t7 n = counter to be decremented # $t8 val = value to be stored in each array element

.text __start: la $t0,ar1 # get absolute address of ar1[0] # addresses of other array elements # will be computed as base + offset lw $t1,size lw $t2,nelems addi $t3,$t2,-1 # calculate nmax = nelems - 1 ori $t4, $0, 0 # initialize index to 0 lwc1 $f0, val mfc1 $t8, $0 loop: mul $t6,$t4,$t1 # loop begins here; offset in bytes add $t5,$t6,$t0 # compute absolute address of ar1[index] sw $t8,0($t5) # store val in ar1[index] addi $t4,$t4,1 # increment array index sub $t7,$t3,$t4 # subtract array index from (nelems - 1) bgez $t7, loop # branch back to loop if diff >= 0 ori $v0,$0,10 syscall ASSEMBLED TEXT FOR init1d.s

[0x00400000] 0x3c081001 lui $8, 4097 # la $t0,ar1 # get absolute [0x00400004] 0x3c011001 lui $1, 4097 [0x00400008] 0x8c290068 lw $9, 104($1) # lw $t1,offset [0x0040000c] 0x3c011001 lui $1, 4097 [0x00400010] 0x8c2a0058 lw $10, 88($1) # lw $t2,index [0x00400014] 0x3c011001 lui $1, 4097 [0x00400018] 0x8c2b0060 lw $11, 96($1) # lw $t3,size [0x0040001c] 0x3c011001 lui $1, 4097 [0x00400020] 0x8c2c0054 lw $12, 84($1) # lw $t4,base [0x00400024] 0x3c011001 lui $1, 4097 [0x00400028] 0x8c2e0050 lw $14, 80($1) # lw $t6,val [0x0040002c] 0x3c011001 lui $1, 4097 [0x00400030] 0x8c2f005c lw $15, 92($1) # lw $t7,nmax [0x00400034] 0x014b0018 mult $10, $11 [0x00400038] 0x00004812 mflo $9 # mul $t1,$t2,$t3 # loop begins [0x0040003c] 0x012c6820 add $13, $9, $12 # add $t5,$t1,$t4 # co [0x00400040] 0xadae0000 sw $14, 0($13) # sw $t6,0($t5) # store val [0x00400044] 0x214a0001 addi $10, $10, 1 # addi $t2,$t2,1 # increme [0x00400048] 0x01eac022 sub $24, $15, $10 # sub $t8,$t7,$t2 # subt [0x0040004c] 0x0701fffa bgez $24 -24 # bgez $t8, loop # branch back [0x00400050] 0x3402000a ori $2, $0, 10 # ori $v0,$0,10 [0x00400054] 0x0000000c syscall # syscall THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (6)

Two-dimensional arrays . In C: Row-major storage order address of ar2[i][j] = ar2 + data size * no columns * i + data size * j where ar2 = address of ar2[0][0] . In FORTRAN: Column-major storage order address of ar2(i,j) = ar2 + data size * no rows * (j-1) + data size * (i-1) where ar2 = address of ar2(1,1)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ARRAY STORAGE ORDER IN C (2-D)

float ar2[4][3];

ar2[0][0] ar2[0][1] ar2[0][2]

ar2[1][0] ar2[1][1] ar2[1][2]

ar2[2][0] ar2[2][1] ar2[2][2]

ar2[3][0] ar2[3][1] ar2[3][2]

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ARRAY STORAGE ORDER IN FORTRAN (2-D)

DIMENSION AR2(4,3)

ar2(1,1) ar2(1,2) ar2(1,3) ar2(2,1) ar2(2,2) ar2(2,3) ar2(3,1) ar2(3,2) ar2(3,3) ar2(4,1) ar2(4,2) ar2(4,3)

c C. D. Cantrell (02/1999) # SAL code to initialize the elements of a 4X3 2-d array # with the value stored in location val, using index computation

.data ar2: .word 0:12 # array of 12 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements base: .word # will hold address of ar2[0][0] rindex: .word 0 # row index; initialized here to 0 cindex: .word 0 # column index; initialized here to 0 nrows: .word 4 # no. of rows ncols: .word 3 # no. of columns nrmax: .word 0 ncmax: .word 0 size: .word 4 # size of an array element, in bytes addr: .word # absolute address of ar2[rindex][cindex] roffset: .word # roffset = rindex * ncols * size coffset: .word # coffset = cindex * size offset: .word # offset = roffset + coffset nc: .word # column counter to be decremented nr: .word # row counter to be decremented

.text __start: nop la base,ar2 # get absolute address of ar2[0][0] sub nrmax,nrows,1 # no. of rows - 1; max. value of row i sub ncmax,ncols,1 # max. value of column index rloop: mul roffset,rindex,ncols # row loop begins here mul roffset,roffset,size # row offset sub cindex,cindex,cindex # re-initialize column index to 0 cloop: mul coffset,cindex,size # column offset add offset,roffset,coffset # offset of ar2[rindex][cindex] add addr,offset,base # compute address of ar2[rindex][cinde move M[addr],val # store val in ar1[index] add cindex,cindex,1 # increment column index sub nc,ncmax,cindex # subtract column index from ncmax bgez nc,cloop # branch back to cloop if diff >= 0 add rindex,rindex,1 # increment row index sub nr,nrmax,rindex # subtract row index from nrmax bgez nr,rloop # branch back to rloop if diff >= 0 done # init2d.s # SPIM code to initialize the elements of a 2-d array # with the value stored in location val .data ar2: .word 0:12 # array of 12 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements nrows: .word 4 # no. of rows ncols: .word 3 # no. of columns size: .word 4 # size of an array element, in bytes

# Register usage # For clarity, each register holds only one variable # Nobody would program in this way, because some registers can be re-used # # $t0 base address (address of ar2[0][0]) # $t1 size = size of an array element (in bytes) # $t2 nrows = number of rows # $t3 ncols = number of columns # $t4 nrmax = nrows - 1 = max. value of row index # $t5 ncmax = ncols - 1 = max. value of column index # $t6 rindex = row index # $t7 cindex = column index # $t8 addr = absolute address of ar2[rindex][cindex] # $t9 roffset = rindex * ncols * size # $s0 coffset = cindex * size # $s1 offset = roffset + coffset # $s2 nc = column counter to be decremented # $s3 nr = row counter to be decremented # $s4 value to be stored in each array element

.text __start: la $t0, ar2 lw $t1, size lw $t2, nrows lw $t3, ncols addi $t4, $t2, -1 # nrmax addi $t5, $t3, -1 # ncmax ori $t6, $0, 0 # initialize row index to 0 lwc1 $f0, val mfc1 $s4, $0 rloop: mul $t9, $t6, $t3 # multiply rindex by ncols mul $t9, $t9, $t1 # multiply by size of one array element to get roffset ori $t7, $0, 0 # initialize column index to 0 cloop: mul $s0, $t7, $t1 # multiply cindex by size to get coffset add $s1, $s0, $t9 # offset of ar2[rindex][cindex] = roffset + coffset add $t8, $s1, $t0 # address of ar2[rindex][cindex] = offset + base sw $s4, 0($t8) # store val in ar2[rindex][cindex] addi $t7, $t7, 1 # increment the column index sub $s2, $t5, $t7 # nc = ncmax - cindex bgez $s2, cloop # branch back to cloop if nc >= 0 addi $t6, $t6, 1 # increment the row index sub $s3, $t4, $t6 # nr = nrmax - rindex bgez $s3, rloop # branch back to rloop if nr >= 0 ori $v0, $0, 10 # reach here if row loop is done syscall # end of program! ASSEMBLED TEXT FOR init2d.s

[0x00400000] 0x3c081001 lui $8, 4097 # la $t0, ar2 [0x00400004] 0x3c011001 lui $1, 4097 [0x00400008] 0x8c29003c lw $9, 60($1) # lw $t1, size [0x0040000c] 0x3c011001 lui $1, 4097 [0x00400010] 0x8c2a0034 lw $10, 52($1) # lw $t2, nrows [0x00400014] 0x3c011001 lui $1, 4097 [0x00400018] 0x8c2b0038 lw $11, 56($1) # lw $t3, ncols [0x0040001c] 0x214cffff addi $12, $10, -1 # addi $t4, $t2, -1 [0x00400020] 0x216dffff addi $13, $11, -1 # addi $t5, $t3, -1 [0x00400024] 0x340e0000 ori $14, $0, 0 # ori $t6, $0, 0 [0x00400028] 0x3c011001 lui $1, 4097 [0x0040002c] 0xc4200030 lwc1 $f0, 48($1) # lwc1 $f0, val [0x00400030] 0x44140000 mfc1 $20, $0 # mfc1 $s4, $0 [0x00400034] 0x01cb0018 mult $14, $11 [0x00400038] 0x0000c812 mflo $25 # mul $t9, $t6, $t3 [0x0040003c] 0x03290018 mult $25, $9 [0x00400040] 0x0000c812 mflo $25 # mul $t9, $t9, $t1 [0x00400044] 0x340f0000 ori $15, $0, 0 # ori $t7, $0, 0 [0x00400048] 0x01e90018 mult $15, $9 [0x0040004c] 0x00008012 mflo $16 # mul $s0, $t7, $t1 [0x00400050] 0x02198820 add $17, $16, $25 # add $s1, $s0, $t9 [0x00400054] 0x0228c020 add $24, $17, $8 # add $t8, $s1, $t0 [0x00400058] 0xaf140000 sw $20, 0($24) # sw $s4, 0($t8) [0x0040005c] 0x21ef0001 addi $15, $15, 1 # addi $t7, $t7, 1 [0x00400060] 0x01af9022 sub $18, $13, $15 # sub $s2, $t5, $t7 [0x00400064] 0x0641fff9 bgez $18 -28 # bgez $s2, cloop [0x00400068] 0x21ce0001 addi $14, $14, 1 # addi $t6, $t6, 1 [0x0040006c] 0x018e9822 sub $19, $12, $14 # sub $s3, $t4, $t6 [0x00400070] 0x0661fff1 bgez $19 -60 # bgez $s3, rloop [0x00400074] 0x3402000a ori $2, $0, 10 # ori $v0, $0, 10 [0x00400078] 0x0000000c syscall # syscall THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (7)

A pointer array is an array whose elements are pointers . The memory locations that are pointed to don’t have to be con- tiguous . A pointer array par2[] can be used to access the rows of a 2- dimensional array ar2[][] address of ar2[i][j] = par2[i] + size * j where par2[i] = address of ar2[i][0] = addr. of row i A pointer-based array program uses more memory than a pro- gram that does index computation Pointers result in a lower instruction count To access ar2[i][0]...ar2[i][n cols 1], increment the pointer by data size at each step (avoids multiplication!)

c C. D. Cantrell (02/1999) # SAL code to initialize the elements of a 4X3 2-d array # with the value stored in location val, using pointers

.data ar2: .word 0:12 # array of 12 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements par2: .word # pointer to ar2[0][0]; base address of array prow: .word 0 # pointer to current row rindex: .word 0 # row index; initialized here to 0 cindex: .word 0 # column index; initialized here to 0 nrows: .word 4 # no. of rows ncols: .word 3 # no. of columns nrmax: .word 0 ncmax: .word 0 size: .word 4 # size of an array element, in bytes pelem: .word 0 # pointer to current array element bytes: .word # ncols * size; no. of bytes in 1 row nc: .word # column counter to be decremented nr: .word # row counter to be decremented

.text __start: nop la par2,ar2 # get absolute address of ar2[0][0] move prow,par2 # address of 1st row = address of 1st move pelem,par2 # address of 1st elem. = par2 mul bytes,ncols,size # compute no. of bytes in 1 row sub nrmax,nrows,1 # max. value of row index sub ncmax,ncols,1 # max. value of column index move nr,nrmax # initialize row counter rloop: move nc,ncmax # re-initialize col. counter move pelem,prow # re-initialize pointer to array elem cloop: move M[pelem],val # store val in ar2[pelem] add pelem,pelem,size # increment pointer to array element sub nc,nc,1 # decrement column counter bgez nc,cloop # branch back to cloop if nc >= 0 add prow,prow,bytes # increment row pointer sub nr,nr,1 # decrement row counter bgez nr,rloop # branch back to rloop if nr >= 0 done # init2dp.s # SPIM code to initialize the elements of a 2-d array # with the value stored in location val, using pointers .data ar2: .word 0:12 # array of 12 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements nrows: .word 4 # no. of rows ncols: .word 3 # no. of columns size: .word 4 # size of an array element, in bytes

# Register usage # For clarity, each register holds only one variable # Nobody would program in this way, because some registers can be re-used # # $t0 base address (address of ar2[0][0]) # $t1 size = size of an array element (in bytes) # $t2 nrows = number of rows # $t3 ncols = number of columns # $t4 nrmax = nrows - 1 = max. value of row index # $t5 ncmax = ncols - 1 = max. value of column index # $t6 prow = pointer to 1st element in row # $t7 pelem = pointer to current array element # = absolute address of ar2[rindex][cindex] # $t8 bytes = no. of bytes in 1 row = ncols * size # $s2 nc = column counter to be decremented # $s3 nr = row counter to be decremented # $s4 value to be stored in each array element

.text __start: la $t0, ar2 # get pointer to start of array or $t6, $t0, $0 # initialize pointer to 1st row lw $t1, size lw $t2, nrows lw $t3, ncols mul $t8, $t3, $t1 # no. of bytes in 1 row = ncols * size addi $t4, $t2, -1 # nrmax addi $t5, $t3, -1 # ncmax or $s3, $t4, $0 # initialize row counter to nrmax lwc1 $f0, val mfc1 $s4, $0 rloop: or $t7, $t6, $0 # initialize pointer to 1st element of 1st row or $s2, $t5, $0 # initialize nc to ncmax cloop: sw $s4, 0($t7) # store val in ar2[rindex][cindex] add $t7, $t7, $t1 # increment the column pointer by the size of 1 element addi $s2, $s2, -1 # decrement nc by 1 bgez $s2, cloop # branch back to cloop if nc >= 0 add $t6, $t6, $t8 # increment the row pointer addi $s3, $s3, -1 # decrement nr by 1 bgez $s3, rloop # branch back to rloop if nr >= 0 ori $v0, $0, 10 # reach here if row loop is done syscall # end of program! ASSEMBLED TEXT FOR init2dp.s

[0x00400000] 0x3c081001 lui $8, 4097 # la $t0, ar2 [0x00400004] 0x01007025 or $14, $8, $0 # or $t6, $t0, $0 [0x00400008] 0x01c07825 or $15, $14, $0 # or $t7, $t6, $0 [0x0040000c] 0x3c011001 lui $1, 4097 [0x00400010] 0x8c29003c lw $9, 60($1) # lw $t1, size [0x00400014] 0x3c011001 lui $1, 4097 [0x00400018] 0x8c2a0034 lw $10, 52($1) # lw $t2, nrows [0x0040001c] 0x3c011001 lui $1, 4097 [0x00400020] 0x8c2b0038 lw $11, 56($1) # lw $t3, ncols [0x00400024] 0x01690018 mult $11, $9 [0x00400028] 0x0000c012 mflo $24 # mul $t8, $t3, $t1 [0x0040002c] 0x214cffff addi $12, $10, -1 # addi $t4, $t2, -1 [0x00400030] 0x216dffff addi $13, $11, -1 # addi $t5, $t3, -1 [0x00400034] 0x01809825 or $19, $12, $0 # or $s3, $t4, $0 [0x00400038] 0x3c011001 lui $1, 4097 [0x0040003c] 0xc4200030 lwc1 $f0, 48($1) # lwc1 $f0, val [0x00400040] 0x44140000 mfc1 $20, $0 # mfc1 $s4, $0 [0x00400044] 0x01c07825 or $15, $14, $0 # or $t7, $t6, $0 [0x00400048] 0x01a09025 or $18, $13, $0 # or $s2, $t5, $0 [0x0040004c] 0xadf40000 sw $20, 0($15) # sw $s4, 0($t7) [0x00400050] 0x01e97820 add $15, $15, $9 # add $t7, $t7, $t1 [0x00400054] 0x2252ffff addi $18, $18, -1 # addi $s2, $s2, -1 [0x00400058] 0x0641fffd bgez $18 -12 # bgez $s2, cloop [0x0040005c] 0x01d87020 add $14, $14, $24 # add $t6, $t6, $t8 [0x00400060] 0x2273ffff addi $19, $19, -1 # addi $s3, $s3, -1 [0x00400064] 0x0661fff8 bgez $19 -32 # bgez $s3, rloop [0x00400068] 0x3402000a ori $2, $0, 10 # ori $v0, $0, 10 [0x0040006c] 0x0000000c syscall # syscall THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (8)

A list is a data structure such that: . All elements are of the same data type . In a list with p elements, each element is labeled with one and only one of the integers 0, ... , p 1 (one can also use 1,...,p) . Example: A list of students enrolled in a computer architecture course A list can be implemented as an array . The elements of a list can be lists or other data structures . Common operations on lists: Insert Delete Find kth element

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (9)

A linked list is a list in which each element contains both data and the address of the next element . Can be implemented as a 2-dimensional array ll[][] . Each linked-list element is a row of the array ll[i][0] contains a pointer to the next element of the linked list The data in the linked-list element is stored in ll[i][1],...,ll[i][n cols 1] (can be elements of another data structure) . Common operations on linked lists: Insert Delete Find kth element

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA STRUCTURES (10)

A stack is a list in which insertions and deletions can be performed only in one position . The insertion/deletion position is called the top of the stack, even if the stack “grows” towards lower addresses . An insertion is called a push; a deletion is called a pop

word 1 word 1 word 1 word 2 word 2 word 2 word 3 word 3 word 3 word 4 word 4 word 4 word 5 word 5 word 5 TOP empty word 6 TOP empty → → empty TOP empty empty → empty empty empty empty empty empty empty empty empty

STACK BEFORE WORD 6 STACK AFTER WORD 6 STACK AFTER WORD 6 IS PUSHED ON IS PUSHED ON IS POPPED OFF

TO PUSH: 1) COPY DATA TO TOP OF STACK 2) INCREMENT STACK POINTER

TO POP: 3) DECREMENT STACK POINTER c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (1)

• Why use procedures or function calls?  To be able to re-use the same code at many points in a program  Modularity simplifies program writing and debugging ◦ Hierarchical design — a good practice when designing any complex system, software or hardware ◦ Each module is defined by its inputs and outputs ◦ Different programmers can work on the same program simultaneously ◦ Each module can be debugged independently of the others using simu- lated inputs

c C. D. Cantrell (04/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (2)

• Steps in transferring to and from a subprogram (procedure or function)  Save the return address ◦ The return address is the address of the instruction (in the program that calls the procedure) which is to be executed after the subprogram has finished running  Issue the procedure call instruction  Execute the instructions in the subprogram  Return to the calling program • Passing data to and from a procedure  Data can be passed either in registers or on a stack (or both)  Pass by value: The actual value of the data passed to, or returned from, the subprogram, is put into a register or pushed onto the stack  Pass by name: A pointer to the data is put into a register or pushed onto the stack c C. D. Cantrell (04/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (3)

• Structure of a C program including function (subprogram) calls: void main(int argc, char *argv[], char *envp[]){ int p, s; s = some_library_function(argv[2],envp[3]); p = func1(s,argv[1]); } int func1(int j, char *george){ /* executable statements that use j and george */ return henry;}  The first statement defines main as a function that returns no value  The operating system passes 3 arguments to main: The argument count (argc), a pointer to an array of arguments (argv[]) and a pointer to an array of environmental parameters (envp[])  A subprogram can be defined in the same source file as main,orina different source file, or in a library of pre-compiled functions c C. D. Cantrell (04/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SUBPROGRAM CALLS IN THE MIPS ISA

• Up to 4 arguments can be passed in the registers $a0 through $a3 ($4 through $7) • The stack may be used for arguments, return addresses and register contents  For a subprogram that is nested to a depth of only 1, no stack is needed unless a large number of variables or pointers to arrays need to be passed  See later slides for stack usage • The C statement return a causes the value of a to be passed to the calling program in register $v0 ($2)  A second value can be returned in register $v1 ($3) (not done by the return statement in C)

c C. D. Cantrell (05/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (4)

• Instructions for saving a return address and calling a procedure:  SAL: The calling program issues the instructions la proc1_ret, ret_addr b proc1 ret_addr: (next instruction) ◦ Saves address of label ret addr in location labeled proc1 ret ◦ Jumps to instruction labeled proc1  MIPS assembly language: The calling program issues the instructions jal proc1 (next instruction) ◦ Saves return address in register $31 ◦ Jumps to instruction labeled proc1 ◦ The instruction jalr rd rs saves the return address in register rd and jumps to the instruction whose address is in register rs

c C. D. Cantrell (04/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (5)

• Instructions for returning from a subprogram (function or procedure):  MIPS assembly language: The subprogram issues the instruction jr $31 ◦ Jumps to instruction whose address is in register $31 ◦ In principle, any other register can be used to save the return address ◦ If subprograms are deeply nested, each return address must be saved on the stack  SAL: The subprogram issues the instruction done ◦ Equivalent to jr $31

c C. D. Cantrell (04/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (6)

• Support for procedure calls in MIPS assembler:  In general, assemblers for RISC architectures give the programmer the bare minimum of support for procedures  Special branch instructions: ◦ jal label – Loads the value of the program counter (the address of the instruction that follows the jal instruction) into register $31 ($ra) and loads the address designated by label into the program counter ◦ jalr [rd] rs – Saves the program counter in register rd and jumps to the instruction whose address is in register rs  If rd is omitted, it defaults to $31 ◦ jr $31 – Loads the word at the address pointed to by $31 ($ra)into the program counter  The programmer must pass arguments, get the result(s) returned by the procedure, and manage the registers  Many levels of procedure calls ⇒ a stack is the ideal data structure c C. D. Cantrell (04/2000) # init1d-proc.s # MIPS assembler code to initialize the elements of a 1-d array # with the value stored in location val, using a procedure call # # Please load this program into SPIM and single-step through it, # while watching the register, text and data windows, and the console # # Data segment # .data ar1: .word 0:20 # array of 20 integers (4 bytes each) val: .float 1.e-3 # value to store in array elements veclen: .word 20 # vector length (number of elements) size: .word 4 # size of an array element, in bytes

.text __start: la $a0,ar1 # First argument: a pointer to ar1[0] lw $a1,size # Second argument: value of size lw $a2,veclen # Third argument: value of vector length lw $a3,val # Fourth argument: value to be stored jal initv ori $v0,$0,10 # These two instructions are equivalent syscall # to the SAL command "done" initv: nop # Entry point to procedure initv blez $a2,beamup # If veclen<=0, I'm outta here ' or $t0,$0,$a0 # Reg. t0 points to the array element or $t1,$0,$a2 # Reg. t1 is a counter loop: sw $a3,0($t0) # Store the value into the array element add $t0,$a1,$t0 # Increment the pointer by the value of size addi $t1,-1 # Decrement the counter bgtz $t1, loop # branch back to loop if counter >= 0 # (since we store at the head of the # loop, we compute one more address # than necessary just to reduce # the number of compares & branches) beamup: jr $ra # Beam me up.... Why a stack is the natural data structure for subprogram calls

main(...){

x = A(arg1,...) A

int A(arg1,...){ y = B(arg2,...) A

int B(arg2,...){ B z = C(arg3,...) A

B int C(arg3,...){ C return c; /* in register $2 */ } A return b; B } A return a; }

}

c C.D.Cantrell (05/2000) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCEDURES (7)

• Recommended stack usage in MIPS assembler:  .frame pseudo-operation: .frame framereg, framesize, returnreg ◦ Must observe convention that $sp + framesize = previous $sp ◦ Example: .frame $sp, 24, $31  Adjust stack with instruction subu $sp, $sp, framesize  Save register with instruction sw reg, framesize+frameoffset-N($sp) ◦ N = 0 for highest-numbered register saved ◦ N is incremented by 4 for each subsequent lower-numbered register saved

c C. D. Cantrell (05/1999) MIPS STACK USAGE

High address

Reference an argument passed by the calling routine Arguments (if any), at a positive offset in reverse order from the frame pointer; reference a local variable or a saved register at a negative offset Local and from the frame pointer temporary $fp variables

Ðframeoffset Saved $ra, $sp, $fp, $gp

Saved $s registers (if any) framesize

Arguments for called procedures (if any), in reverse order $sp Reference a local variable or a saved register at a positive offset Low address from the stack pointer STACK ALLOCATION

High address

Arguments for the called procedure, in reverse order $fp $fp

Local and temporary variables $sp $sp $fp

Saved $ra, $sp, $fp, $gp

Saved $s registers (if any) framesize

Arguments for procedures called by the called procedure, in reverse order

$sp Low address Before a During a After a procedure procedure procedure call call call THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FACTORIAL PROGRAM (1)

• Example of tail recursion (recursive computation on the stack) • Modular design begins by specifying all inputs and outputs • Recursive algorithm for the factorial function: Definition: n!:= n(n − 1) ···2 · 1 (Multiply n by n − 1, multiply the result by n − 2, ...) Recursive algorithm:

rk = mk−1rk−1,mk = mk−1 − 1

where rk is the running product and mk is a counter

Start the algorithm with r0 =1,m0 = n; then

r1 = n, m1 = n − 1 r2 = n(n − 1),m2 = n − 2 .

rn = n(n − 1) ···1,mn = n − n =0 c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FACTORIAL PROGRAM (2)

Block diagram of factorial algorithm:

m m Ð1 m Ð1≤ 0? 0 fact yes

r rm n!

no

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FACTORIAL PROGRAM (3)

Implement the recursive algorithm for n! as a procedure that calls itself . Only 3 arguments are needed . Arguments could be passed on stack (slow!) . Arguments could be passed in registers (fast; easy to program) . Would be even faster if procedure call overhead were eliminated

c C. D. Cantrell (05/1999) # fact.s # # Computation of the factorial function # n! := n(n-1)...1 # using the example of stack usage and procedure calls in Goodman and Miller, # p. 231 # # This program computes the factorial function of a number n, stores the # result, and prints the answer on the console. The key routine is called # fact. This routine functions as a "black box" with 2 inputs (m and r) # and 2 outputs (m-1 and mr). Suppose that m is initialized to n and that # r is initialized to 1. Then, if m-1 > 0, the outputs are fed back into # the inputs. If m-1 <= 0, the mr output is equal to n! # # The only tricky part is the use of a recursive procedure, that is, # a procedure that calls itself. To understand what is happening # in this program, you MUST single-step through it. Look at the # registers and the stack as you step. # # Design goals and constraints: # 1. Return the correct value of n! if n >= 0 # 2. Check whether n < 0, and, if yes, print an error message # 3. Use tail recursion (recursion on the stack) # ################################################################################ .data num: .word 5 # Arg of the factorial function result: .word 1 # Will hold num! string1: .asciiz " factorial is " string2: .asciiz "Error! Argument of factorial is negative" ################################################################################ .text __start: lw $s4,num # Register 20 gets loaded with the argument # to the factorial function; its value will be used # as a counter li $s2,1 # Register 18 will hold the answer # and the running product or $v0,$0,$s2 # Copy initial result to $v0 in preparation for the # next instruction beqz $s4, OK # Go to the exit if the argument is 0 # Call fact only if the argument is != 0 jal fact # Call to the factorial procedure # (register 31 holds the return address) # Note use of a recursive procedure call! or $v1,$0,$v0 # Save result in $v1 so we can use $v0 for syscalls bgtz $v0, OK # Go around the error handling code if # result is OK err: ori $v0,$0,4 # Print error message la $a0,string2 syscall blez $v1,ttfn # Go to exit code OK: sw $v0,result # Save the result lw $a0,num # Put num in reg $v0 ori $v0,$0,1 # Print value of num syscall ori $v0,$0,4 # Print " factorial is " la $a0,string1 syscall or $a0,$0,$v1 # Copy result to reg $a0 ori $v0,$0,1 # Print value of (num)! syscall ttfn: ori $v0,$0,10 syscall # TTFN fact: addi $v0,$0,-1 # Entry point for factorial routine # A negative result indicates an error; # $v0 will stay negative only if it is not # overwritten by the instruction before goback bltz $s4, goback # Test for invalid (negative) counter sw $ra,0($sp) # Push return address onto stack add $sp,$sp,-4 # Register 29 points to the first free # location if: mul $s2,$s2,$s4 add $s3,$s4,-1 # Subtract 1 from the counter blez $s3,endif # and go to endif if <= 0 move $s4,$s3 # Update the counter jal fact # Call fact again endif: add $sp,$sp,4 # Pop a return address from the stack lw $ra,0($sp) # This statement loads the return address # that was stored just above the stack # pointer into register 31, in preparation # for the return or $v0,$0,$s2 # Put result in register v0 goback: jr $ra # Go back one step in the recursion, # or return to the main routine if this is # the return from the first call to fact ################################################################################ # Example of stack usage and procedure calls from Goodman and Miller, # p. 231 (bugs fixed and I/O added) # #Thisprogramcomputesthevalueofbaseraisedtopower,storesthe #result,andprintstheanswerontheconsole. # # The only tricky part is the use of a recursive procedure, that is, # a procedure that calls itself. To understand what is happening # in this program, you MUST single-step through it. Look at the # registers and the stack as you step.

.data exp: .word 0 # The exponent base: .word 0 # The base result: .word 1 # Will hold base^exponent string1:.asciiz"tothepower" string2: .asciiz " is " newl: .asciiz "\n" string3: .asciiz "Input a base " string4: .asciiz "Input an exp " .text main: li $v0,4 la$a0,string3#putsstring3 syscall li $v0,5 syscall sw $v0,base li $v0,4 la$a0,newl#printsanewline syscall li $v0,4 la$a0,string4#putsstring4 syscall li $v0,5 syscall sw $v0,exp li $v0,4 la$a0,newl#printsanewline syscall

lw $s0,base # Register 16 holds the base li $s2,1 # Register 18 will hold the answer # and the running product lw $s4,exp # Register 20 is a counter jalpower#Calltothepowerprocedure # (register 31 holds the return address) # Note use of a recursive procedure call!

sw $s2,result # Save the result li $v0,1 #put base # Output the result lw$a0,base syscall li $v0,4 la$a0,string1#putsstring1 syscall li $v0,1 lw$a0,exp#putexp syscall li $v0,4 la $a0, string2 # puts string2 syscall li $v0,1 lw $a0,result #put result syscall li $v0,4 la $a0, newl # prints a newline syscall li $v0, 10 syscall #done # TTFN power: sw $ra,0($sp) # Entry point for power routine # Push return address onto stack add $sp,$sp,-4 # Register 29 points to the first free # location if: add $s3,$s4,-1 # Subtract 1 from the counter blez $s3,endif # and go to endif if <= 0 move $s4,$s3 # Update the counter jal power # Call power again endif: mul $s2,$s2,$s0 # Get here only if counter <= 0 # Put base X running product in Reg. 18 add $sp,$sp,4 # Pop a return address from the stack lw $ra,0($sp) goback: jr $ra # and go back one step in the recursion THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OPERATING SYSTEMS (1)

• An operating system (OS) is a program that performs system startup and controls the execution of all other programs  Provides a higher level of abstraction than the instruction set architecture  User programs don’t have to know any details of the hardware or the instruction set  The kernel is the part of an OS that: ◦ Runs at the highest privilege level  The kernel’s private address space (≥0x80000000 in MIPS R2000) is processor-locked against access by less privileged programs ◦ Mediates access by all user programs to the hardware  CPU, memory, I/Odevices ◦ Mediates access by all user programs to basic software routines  File operations, communication with other processes, network

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXAMPLES OF OPERATING SYSTEMS (1)

• MVS (aka OS/390 or z/OS)  Used on IBM mainframe computers  Most stable and secure of all OS’s • UNIX  Used on many servers and special-purpose desktops (IC design, etc.)  Highly stable and secure  Originated at AT&T  Many variants ◦ Open-source  FreeBSD, Mac OS X (Darwin kernel only) ◦ Proprietary  Solaris (Sun Microsystems), HP-UX (Hewlett-Packard)  Open-API (application programming interface) standard: POSIX

c C. D. Cantrell (08/2004) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXAMPLES OF OPERATING SYSTEMS (2)

• Linux  Kernel created by a student, Linus Torvalds, to provide a free, open- source, UNIX-like OS  Suppported by thousands of programmers world-wide  Favored by industry for its customizability • Microsoft Windows  Proprietary  Most widely-installed OS  Some applications created for Windows are default standards (Microsoft Office suite)  Widely criticized by many IT professionals for bugs, security problems, and instability

c C. D. Cantrell (08/2004) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OPERATING SYSTEMS (2)

• A microkernel is a kernel that provides only the most essential services  Memory management, interprocess communication, basic scheduling  Examples: vmunix, vmlinux, System, kernel32.dll • Programs issue system calls to request execution of a kernel routine (file access, I/O, interprocess communication, network protocols)  The kernel’s call handler uses information passed as arguments in the system call to decide what action to take  After servicing a system call, the kernel can resume execution of, suspend, or terminate the process that issued the system call  The kernel can schedule other tasks while a process awaits the return from a system call (if the system call returns to the process)

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OPERATING SYSTEMS (3)

• Services provided by modern operating systems:  Memory management (virtual memory)  Priority-controlled execution of applications and utility programs ◦ The OS calls other programs as if they were procedures, and allows them to run until their time is up or an exception occurs  Interprocess communication  Access to I/O devices (screen, mouse, keyboard, printer, ...) ◦ A driver is a routine that issues commands to an I/O device  Access to files ◦ Actions include open, close, rewind, seek ◦ Permissions (stored with each file) control access  Detection and handling of events that interrupt program execution ◦ Hardware interrupts, processor exceptions, protection faults  Accounting c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REFERENCES ON OPERATING SYSTEMS

Nothing can take the place of a good course, but these books may help: 1. Operating Systems: Internals and Design Principles, 3rd Edition, by William Stallings (Prentice Hall, 1998; ISBN 0-13-887407-7). 2. Operating Systems: Design And Implementation, 2nd Edition, by Andrew S. Tanenbaum (Prentice Hall, 1997; ISBN 0-13-638677-6). 3. The Design and Implementation of the 4.4BSD Operating System, by Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quar- terman (Addison-Wesley, 1996; ISBN 0-201-54979-4). 4. Design of the UNIX Operating System, by Maurice Bach (Prentice Hall, 1987; ISBN 0-13-201799-7).

c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW A COMPUTER RUNS PROGRAMS (1)

Single-tasking operating system (PC-DOS, 1950’s mainframe systems) . Only one program can run at a time . User programs can call OS functions directly (BIOS routines in PC) . User programs can issue commands directly to the hardware Aords easy entry for destructive viruses (which can execute FORMAT C:, for example) . Division between OS memory segment and user memory segment is voluntary (not enforced by hardware) User programs can clobber the OS memory segment ⇒

c C. D. Cantrell (10/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW A COMPUTER RUNS PROGRAMS (2)

Multi-tasking operating system (UNIX, MVS, Windows NT, MacOS) . Many programs can run simultaneously (though only one at a time can change the program counter) . Each user program “sees” a virtual machine Address space in memory Registers Program counter . The operating system has a memory-resident kernel, which provides low-level I/O, hardware access and security In the MIPS R2000 ISA, kernel addresses have the most signicant bit set (kernel addresses are 0x8000 0000) Kernel functions and the kernel’s memory space can be accessed only in a special CPU mode (kernel mode) User programs obtain services from the OS through system calls c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS USER VIRTUAL ADDRESS SPACE

$sp 0x7fffffff Stack Segment

Dynamic Data Segment $gp 0x10008000

Static Data Segment 0x10000000

Text Segment

$pc 0x00400000

Reserved

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSES (1)

A program is an executable le . All C programs that run under Solaris have the interface main(int argc, char *argv[], char *envp[]) A process is an instance of a program in execution . In the UNIX OS, a process is created by a system call (fork) . In a multitasking OS, one user can execute many processes simultaneously . Many instances of the same program can execute simultaneously (e.g., UNIX shell commands such as cd, cp, etc.) . A process terminates: When it reaches the end of its main() function When the kernel kills it because of timeout, an exception, termi- nation of its parent process, or user action

c C. D. Cantrell (10/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW A COMPUTER LAUNCHES A PROCESS

All C programs that run under Solaris have the interface main(int argc, char *argv[], char *envp[]) To launch a user process, the kernel . Creates an address space for the new process . Copies the executable text module into the address space . Pushes the command-line arguments and the environment arguments onto the stack . Calls main as a procedure

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ARGUMENTS AND ENVIRONMENTAL VARIABLES

C program prarg.c to obtain values of arguments: #include main(int argc, char *argv[], char *envp[]) { int index; for (index=0; argv[index] != NULL; index++) printf("argv[%d] = %s\n", index, argv[index]); } C program prenv.c to obtain values of environmental variables: #include main(int argc, char *argv[], char *envp[]) { int index; for (index=0; envp[index] != NULL; index++) printf("envp[%d] = %s\n", index, envp[index]);

} c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VALUES OF ARGUMENTS thor% cc prarg.c thor% mv a.out prarg thor% prarg arg1 arg2 arg3 arg4 argv[0] = prarg argv[1] = arg1 argv[2] = arg2 argv[3] = arg3 argv[4] = arg4

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VALUES OF ENVIRONMENTAL VARIABLES thor% cc prenv.c thor% mv a.out prenv thor% prenv envp[0] = HOME=/home/thor/cantrell envp[1] = PATH=/bin:/home/thor/cantrell/bin:/usr/bin: envp[2] = LOGNAME=cantrell envp[3] = HZ=100 envp[4] = TERM=xterm envp[5] = TZ=US/Central envp[6] = SHELL=/bin/csh envp[7] = MAIL=/var/mail/cantrell envp[8] = PWD=/home/thor/cantrell/codes/c.progs envp[9] = USER=cantrell envp[10] = HOST=thor . .

c C. D. Cantrell (04/1999) . THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

XSPIM STARTUP CODE BEFORE LOADING A PROGRAM

[0x00400000] 0x8fa40000 lw $4, 0($29) #$a0 points to argc (top of stack) [0x00400004] 0x27a50004 addiu $5, $29, 4 #$a1 points to argv (next) [0x00400008] 0x24a60004 addiu $6, $5, 4 #$a2 would point to envp if there # were no command-line arguments [0x0040000c] 0x00041080 sll $2, $4, 2 #multiply argc by 4 (gives number # of bytes needed to hold pointers # to arguments) [0x00400010] 0x00c23021 addu $6, $6, $2 #$a2 now points to envp, with # space needed for arguments # taken into account [0x00400014] 0x0c000000 jal 0x00000000 [main] #main called as a procedure [0x00400018] 0x3402000a ori $2, $0, 10 #$v0 holds number of system call # (exit routine, in this case) [0x0040001c] 0x0000000c syscall #system call

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

XSPIM STARTUP CODE AFTER LOADING A PROGRAM

[0x00400000] 0x8fa40000 lw $4, 0($29) #$a0 points to argc (top of stack) [0x00400004] 0x27a50004 addiu $5, $29, 4 #$a1 points to argv (next) [0x00400008] 0x24a60004 addiu $6, $5, 4 #$a2 would point to envp if there # were no command-line arguments [0x0040000c] 0x00041080 sll $2, $4, 2 #multiply argc by 4 (gives number # of bytes needed to hold pointers # to arguments) [0x00400010] 0x00c23021 addu $6, $6, $2 #$a2 now points to envp, with # space needed for arguments # taken into account [0x00400014] 0x0c100008 jal 0x00400020 #main called as a procedure # (note that the loader has # inserted the correct address) [0x00400018] 0x3402000a ori $2, $0, 10 #$v0 holds number of system call # (exit routine, in this case) [0x0040001c] 0x0000000c syscall #system call [0x00400020] 0x3c071001 lui $7, 4097 [ar1] #start of "main" ...

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

XSPIM STACK BEFORE jal main (1)

STACK [0x7fffebe0] 0x00000001 0x7fffec7c 0x00000000 0x7fffefe4 [0x7fffebf0] 0x7fffef30 0x7fffef1f 0x7fffef18 0x7fffef0d [0x7fffec00] 0x7fffeeff 0x7fffeef0 0x7fffeed8 0x7fffeec0 [0x7fffec10] 0x7fffeeb2 0x7fffeea8 0x7fffee99 0x7fffee8b [0x7fffec20] 0x7fffee72 0x7fffee4b 0x7fffee22 0x7fffedb6 [0x7fffec30] 0x7fffed9b 0x7fffed7e 0x7fffed60 0x7fffed55 [0x7fffec40] 0x7fffed49 0x7fffed3a 0x7fffed2e 0x7fffed22 [0x7fffec50] 0x7fffed18 0x7fffed10 0x7fffed05 0x7fffecf5 [0x7fffec60] 0x7fffecdd 0x7fffecb9 0x7fffec9f 0x00000000 [0x7fffec70] 0x00000000 0x00000000 0x00000000 0x2f686f6d [0x7fffec80] 0x652f7468 0x6f722f63 0x616e7472 0x656c6c2f [0x7fffec90] 0x696e6974 0x31642d78 0x7370696d 0x2e730044 [0x7fffeca0] 0x4953504c 0x41593d31 0x32392e31 0x31302e38 ... [0x7fffefe0] 0x62696e00 0x484f4d45 0x3d2f686f 0x6d652f74 [0x7fffeff0] 0x686f722f 0x63616e74 0x72656c6c 0x00000000

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

XSPIM STACK BEFORE jal main (2)

STACK [0x7fffebe0] 0x00000001 0x7fffec7c 0x00000000 0x7fffefe4 ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ argc = 1 pointer to terminating pointer to argument null word 1st env var ... [0x7fffec70] 0x00000000 0x00000000 0x00000000 0x2f686f6d ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ /hom [0x7fffec80] 0x652f7468 0x6f722f63 0x616e7472 0x656c6c2f ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ e/th or/c antr ell/ [0x7fffec90] 0x696e6974 0x31642d78 0x7370696d 0x2e730044 ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ init 1d-x spim .s Recall that the rst argument of every procedure is the procedure’s lename

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSES (2)

When the UNIX kernel creates a process, it creates a process structure that contains: . Process number, user & group IDs, privileges . Processor state information User-visible registers, program counter, status register, etc. Stack pointers to system stack(s) employed by the process for procedure calls . Process control information Scheduling information (process ready for execution, blocked, waiting for I/O, waiting for a specic event, etc.) Files opened or used by the process Data structuring (pointers to other processes in a queue, etc.) Interprocess communication

c C. D. Cantrell (10/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INSPECTION OF RUNNING PROCESSES USING ps apache% ps -ef | more UID PID PPID C STIME TTY TIME CMD root 0 0 0 Apr 20 ? 0:15 sched root 1 0 0 Apr 20 ? 2:05 /etc/init - root 2 0 0 Apr 20 ? 0:22 pageout root 3 0 1 Apr 20 ? 83:17 fsflush ggb 27691 27676 0 07:58:22 pts/21 0:01 telnet alanthia.gator.net 1536 root 26125 1 0 Apr 25 ? 0:37 /usr/sbin/syslogd root 110 1 0 Apr 20 ? 0:45 /usr/sbin/rpcbind root 518 1 0 Apr 20 ? 0:00 /usr/lib/saf/sac -t 300 root 140 1 0 Apr 20 ? 0:57 /usr/sbin/inetd -s -t dkw 22751 20173 0 Apr 25 pts/51 0:00 vi project.pl root 118 1 0 Apr 20 ? 0:02 /usr/sbin/nis_cachemgr root 147 1 0 Apr 20 ? 0:01 /usr/lib/nfs/lockd root 145 1 0 Apr 20 ? 0:06 /usr/lib/nfs/statd root 195 1 0 Apr 20 ? 21:47 /usr/lib/autofs/automountd liux 29777 29771 0 08:48:40 pts/65 0:00 telnet titan root 19238 140 0 00:54:19 ? 0:00 in.telnetd c C. D. Cantrell (04/1999) nigel 29379 29377 0 Apr 21 pts/59 0:00 -bash THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSES (3)

When the UNIX kernel creates a process, it allocates a standard process address space that includes: . User data segment(s) . User text segment(s) . System stack(s) to support procedure calls . An initially unallocated protected region from which the process can allocate space using malloc()

c C. D. Cantrell (10/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INSPECTION OF PROCESS ADDRESS SPACE thor% /usr/proc/bin/pmap 373 373: -csh -c unsetenv _ PWD; unsetenv DT; setenv DISPLAY : 00010000 144K read/exec /usr/bin/csh 00042000 24K read/write/exec /usr/bin/csh 00048000 192K read/write/exec [ heap ] EF680000 592K read/exec /usr/lib/libc.so.1 EF722000 32K read/write/exec /usr/lib/libc.so.1 EF72A000 8K read/write/exec [ anon ] EF770000 16K read/exec /usr/platform/sun4u/lib/libc_psr.so.1 EF790000 8K read/exec /usr/lib/libmapmalloc.so.1 EF7A0000 8K read/write/exec /usr/lib/libmapmalloc.so.1 EF7B0000 8K read/exec /usr/lib/libdl.so.1 EF7C0000 8K read/write/exec [ anon ] EF7D0000 112K read/exec /usr/lib/ld.so.1 EF7FA000 8K read/write/exec /usr/lib/ld.so.1 EFFF6000 40K read/write/exec [ stack ] total 1200K

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXCEPTIONS (1)

An exception is an event that causes a process to stop executing . System call (explicit instruction, e.g. for I/O) Stopping execution permits another process to execute while the process that made the syscall waits for I/O . Interrupt caused by an event external to the current instruction Example: Hardware controller signals end of I/O . Exception associated with execution of the current instruction Bus error (I/O timeout, load/store kernel physical address) Protection exception Attempt to execute a reserved instruction Cache/TLB miss Floating-point arithmetic exception

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXCEPTIONS (2)

In the R2000 ISA, exceptions are handled by coprocessor 0 How the R2000 processor and the UNIX kernel perform exception handling: 1. Processor exits user mode & is forced into kernel mode. 2. The address of an exception vector (exception handling program) is loaded into the program counter (PC). . Reset exception (reboot): the processor transfers control to the Reset exception vector at address 0xbfc00000 . UTLB Miss: Control is transferred to the exception vector pointed to by the contents of address 0x80000000 . All other exceptions are handled by the kernel The general exception handler pointed to by the contents of address 0x80000080 takes control, gets the cause from the Cause register and transfers the correct exception handler

c C. D. Cantrell (04/1999) MIPS CP0 and Exception Handling Registers

TLB TLB Status Cause EntryHi EntryLo Register Register

Index Context EPC Register Register Register

TLB (Translation Random BadVAddr PRId Lookaside Register Register Register Buffer)

Used with virtual memory

“Safe” Entries Used for exception processing THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HOW A COMPUTER BOOTS UP

Before any processes can execute, the operating system kernel must be loaded into memory 1. When the processor recognizes the Reset exception, the Reset ex- ception vector loads a short program called the bootstrap loader from ROM, EPROM or NVRAM into memory and transfers control to it . In ancient times, the binary machine instructions for the boot- strap loader were toggled in from the system console, or read from cards or paper tape 2. The bootstrap loader reads another program from a xed location on disk (called the boot block) and transfers control to it 3. The program read in by the bootstrap loader reads the kernel from disk and transfers control to it

c C. D. Cantrell (10/1998) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSES (4)

In making a process switch, the UNIX kernel: . Pushes the register context and system stack onto a context stack maintained by the kernel . Uses a scheduling algorithm to nd the best process to execute next . Retrieves (pops) the context of the next process from the context stack . Restores the context (registers, stack, mode) . Loads the entry address of the next process into the program counter

c C. D. Cantrell (04/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSOR DESIGN (1)

• Major steps in designing a processor: Datapath design ◦ Instructions  Instruction memory  Instruction fetch  Program counter and adder ◦ Data  Data memory  Register file  ALU Control design ◦ Control signal specification ◦ Implementation  Hardwired  Microprogrammed c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSOR DATAPATH

• At the end of every clock cycle, data needed in later cycles must be stored in a state element  Data needed by subsequent instructions is stored in general-purpose registers (GPRs) ◦ GPRs are the only state elements needed if all instructions execute in one clock period  Multi-cycle implementation: ◦ Instructions execute in more than one clock cycle ◦ Data needed by one instruction in subsequent clock cycles of its execution must be stored in special-purpose registers Memory address register

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BUILDING BLOCKS OF THE INSTRUCTION DATAPATH

32 Instruction address 32 32 32 32 PC 32 Instruction Add Sum

Instruction 32 memory

a. Instruction memory b. Program counter c. Adder

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PART OF THE INSTRUCTION DATAPATH

32

32

Add

4 32

32 Read PC address

32 Instruction Instruction memory

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MODULES NEEDED FOR R-FORMAT INSTRUCTIONS

ALU control 5 Read 3 register 1 Read 32 32 Register 5 Read data 1 numbers register 2 Zero Registers Data ALU ALU 32 5 Write result register Read 32 32 32 Write data 2 Data data

RegWrite

a. Register file b. ALU

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALU CONTROL SIGNALS (1)

• Three signals are necessaryfor the simple ALU of P & H, Chapter 4 Bnegate: Asserts Binvert and CarryIn Operation: Selects the output signal

◦ 0 for and, 1 for or or slt,210 for add or sub

ALU Control Signals Bnegate Operation MIPS b2 b1 b0 Instructions 0 0 0 and 0 0 1 or 0 1 0 add 1 1 0 sub 1 1 1 slt

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-BIT ALU SCHEMATIC DIAGRAM

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALU CONTROL SIGNALS (2)

• Functions of the ALU control signals for 1 bit out of 32 or 64:

The value of b1b0 determines which device’s output is selected

◦ b1b0 = 00 selects the AND output ◦ b1b0 = 01 selects the OR output ◦ b1b0 = 10 selects the SUM output of the adder ◦ b1b0 = 11 selects the Less input  Less is asserted onlyin the LSB, and onlyif the MSB is 1 Bnegate selects between addition and twos complement subtraction ◦ If Bnegate is deasserted, the output of the adder is a + b ◦ If Bnegate is asserted, the adder computes a + ¯b +1:  The lower 2–1 MUX selects ¯b  The upper 2–1 MUX sends the (asserted) Bnegate signal to the carry-in input of the adder (this is +1)

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR R-FORMAT INSTRUCTIONS Op 655556

ALU control Rs 5 Read 3 register 1 Read 32

Rt data 1 5 Read register 2 Zero ALU Rd Register file ALU 5 Write result register 32

Shamt Read Write data 2 data Funct RegWrite 32 Instruction

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

R-FORMAT DATAPATH: EXAMPLE (1)

• Show the hexadecimal values of all datapath signals for the instruction add $5,$4,$3  The values in the registers read are:

◦ ($3)=410 ◦ ($4)=−1310 • Results are shown in the following slide

c C. D. Cantrell (12/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

R-FORMAT DATAPATH: EXAMPLE (2) Op 655556

0x04 Rs Read register 1 Read 0xFFFFFFF3 0x03 data 1 Rt Read register 2 Zero Register file ALU ALU Rd 0x05 Write result register Read 0x00000004 Shamt Write data 2 data Funct 0xFFFFFFF7

Instruction c C. D. Cantrell (12/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MODULES NEEDED FOR LOADS AND STORES

MemWrite

32 Address Read 32 data 16 32 Sign extend 32 Write Data data memory

MemRead

a. Data memory unit b. Sign-extension unit

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR LOADS Op 5 16 655 Register Access Address Computation Memory Read ALU operation Rs 5 Read 3 register 1 Read 32 data 1 Instruction

Rt 5 Write Zero register ALU ALU 32 Registers result Read Address data

Write Data data memory Imm RegWrite Write data 16 32 Sign MemRead extend

32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR LOADS: EXAMPLE (1)

• Show the hexadecimal values of all datapath signals for the instruction lw $4,-12($17)  The value in the register read is: ◦ ($17)=0x10010010  The value in the memory location pointed to by -12($17) is:

◦ ([0x10010004]) = −3010 • Results are shown in the following slide

c C. D. Cantrell (12/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR LOADS: EXAMPLE (2)

Register Access Address Computation Memory Read or Write ALU operation 0x11 Read 3 register 1 Read 0x10010010 MemWrite Read data 1 Instruction register 2 Zero Registers ALU ALU 0x04 0x10010004 Read Write result Address register data Read data 2 Write Data data memory RegWrite Write data

0xFFF4 Sign 0xFFFFFFF4 MemRead extend

0xFFFFFFE2

c C. D. Cantrell (12/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR STORES Op 5 16 655 Register Access Address Computation Memory Write ALU operation Rs 5 Read 3 register 1 Read 32 MemWrite data 1 Instruction

Rt 5 Read Zero register 2 ALU ALU 32 Registers result Read Address data Read data 2 Write Data data memory Imm 32 Write data 16 32 Sign extend

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR CONDITIONAL BRANCHES

32 PC + 4 from instruction datapath

32 Add Sum Branch target

Shift

Op left 2 5 16655

ALU operation

Rs 5 Read 3 register 1 Instruction Read 32 Rt 5 Read data 1 register 2 To branch Registers ALU Zero Write control logic 16 register Offset Read 32 data 2 Write data RegWrite

16 32 Sign extend

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH FOR A JUMP OR JUMP AND LINK Op 626

Shift 32 PC Branch target left 2

Instruction Read

Target register 1 Read Read data 1 register 2 Registers ALU Write register Read data 2 Write data

26 32 Sign extend

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMBINED ALU/LOAD/STORE DATAPATH

3 ALU operation 5 Read register 1 Read 32 MemWrite 5 Read data 1 MemtoReg Instruction register 2 ALUSrc Zero Registers Read 32 ALU ALU 32 32 5 Write data 2 result Address Read register M data u Data M u Write x memory data x 32 Write RegWrite data 16 32 32 Sign extend MemRead

32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMBINED FETCH/ALU/LOAD/STORE DATAPATH

32

32

Add

4

Registers 5 Read ALU operation register 1 3 MemWrite Read PC Read 32 address 5 Read MemtoReg data 1 register 2 ALUSrc Zero Instruction 32 ALU 5 Write Read ALU Address Read 32 register data 2 result data M M Instruction u u Write x Data x memory data 32 Write memory RegWrite data 16 32 Sign MemRead extend 32

32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FETCH/ALU/LOAD/STORE/BRANCH DATAPATH

32

32 PCSrc

32 M Add u x Add ALU 32 4 result Shift left 2 Registers ALU operation 5 Read 3 MemWrite Read register 1 ALUSrc PC Read 32 address 5 Read data 1 MemtoReg register 2 Zero Instruction ALU 32 ALU Read 5 Write Read result Address register data 2 M data M Instruction u u memory Write x Data x data 32 Write memory RegWrite data 16 32 Sign extend MemRead 32 32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CONTROL UNIT DESIGN (1)

• Purpose of control unit: Use the Opcode and Function fields’ bits to set the levels of the control signals shown in orange on the preceding slide • Overall approach: Multiple levels of decoding from Opcode and Function fields to control signals Common implementation technique Can reduce size of main control unit Several small control units maybe faster than one large unit ◦ Control unit is often performance-critical Example of multiple-level approach: ◦ The main control unit generates a new 2-bit signal, ALUOp, from the Opcode field of the instruction (bits 31–26) ◦ The ALU control unit uses the ALUOp signal and the 6-bit Function field of the instruction (bits 5–0) to set the 3-bit ALU Operation signal c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH WITH CONTROL LINES IDENTIFIED

PCSrc

1 M Add u x 4 ALU 0 Add result RegWrite Shift left 2

Instruction [25–21] Read register 1 Read MemWrite PC Read Instruction [20–16] data 1 address Read ALUSrc MemtoReg Instruction register 2 Zero 1 Read [31–0] Write 1 ALU ALU Read M data 2 result Address 1 u register M data Instruction Instruction [15–11] u M x Write x u memory Registers x 0 data 0 Write Data 0 RegDst data memory Instruction [15–0] 16 Sign 32 extend ALU MemRead control Instruction [5–0]

ALUOp

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CONTROL UNIT DESIGN (2)

ALUOp, Function, and ALU Control bits Instruction Instruction Function ALU ALU Control Opcode ALUOp Operation Field Action Signal lw 00 load word dddddd add 010 sw 00 store word dddddd add 010 beq 01 branch equal dddddd subtract 110 R-type 10 add 100000 add 010 R-type 10 subtract 100010 subtract 110 R-type 10 and 100100 and 000 R-type 10 or 100101 or 001 R-type 10 set on < 101010 set on < 111

• This is a condensed version of the full 256-row truth table • ALUOp indexes the instruction type (R-type, load/store, branch) • See a previous slide for the ALU Control Signal c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CONTROL UNIT DESIGN (3)

Truth Table for ALU Control bits ALUOp Function field ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 ALU Control 0 0 d d d d d d 010 d 1 d d d d d d 110 1 d d d 0 0 0 0 010 1 d d d 0 0 1 0 110 1 d d d 0 1 0 0 000 1 d d d 0 1 0 1 001 1 d d d 1 0 1 0 111

• The “don’t cares”( d) indicate signals that don’t have to be used as inputs to the AND level of the ALU Control Block, which implements the 3 Boolean functions Operation(2–0) (ALU Control (2–0))

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ALU CONTROL BLOCK

ALUOp 2 ALU control block ALUOp0 ALUOp1

Operation2 F3 3 Operation F2 Operation1 6 F(5Ð0) F1 Operation0 F0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SIMPLE DATAPATH WITH CONTROL UNIT

0 M u x Add ALU 1 result Add Shift PCSrc RegDst left 2 4 Branch MemRead Instruction [31 26] MemtoReg Control ALUOp 2 MemWrite ALUSrc RegWrite

Instruction [25 21] 5 Read Read register 1 32 PC address Read data 1 Instruction [20 16] 5 Read Zero Instruction register 2 32 0 Registers Read ALU 32 [31–0] 0 ALU M 5 Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15 11] Write 32 x u 1 Data x data 1 32 memory 0 Write data 32 16 32 Instruction [15 0] Sign extend ALU 3 control

Instruction [5 0] 6 2 32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF THE CONTROL SIGNALS

Signal Name Effect when deasserted Effect when asserted RegDst Dest. reg. no. for Write Dest. reg. no. for Write comes from rt (bits 20–16) comes from rd (bits 15–11) RegWrite None Reg. on Write reg. input is written w. Write data val. ALUSrc 2nd ALU operand 2nd ALU operand comes comes from Read data 2 from bits 15–0 (sign-ext.) PCSrc PC → PC+4 PC → branch target MemRead None Data at address is put on Read data output MemWrite None Data at address is replaced byWrite data val. MemtoReg Val. fed to Write data Val. fed to Write data comes from ALU comes from main memory

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SETTINGS OF CONTROL LINES

Inst. Reg ALU Mem- Reg Mem Mem Br ALUOp type Dst Src to-reg. Wr Rd Wr bit 1 bit 0 R-type 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw d 1 d 0 0 1 0 0 0 beq d 0 d 0 0 0 1 0 1

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMBINATIONAL LOGIC IMPLEMENTATION

Inputs Op5 Op4 Op

655556 Op3 Op2 Op1 Op0 Rs

Outputs R-format Iw sw beq RegDst

ALUSrc Rt MemtoReg RegWrite MemRead MemWrite Rd Branch ALUOp ALUOp1 ALU control block ALUOp0 ALUOp0 Shamt ALUOp1

Operation2 F3 Operation

Funct F2 Operation1 F(5Ð0) F1 Operation0 F0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PHASE 1 OF R-TYPE EXECUTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read PC Read register 1 address Read data 1 Instruction [20–16] Read Zero Instruction register 2 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15–11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15–0] Sign extend ALU control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PHASE 2 OF R-TYPE EXECUTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read Read register 1 PC address Read data 1 Instruction [20–16] Read Zero Instruction register 2 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15–11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15–0] Sign extend ALU control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PHASE 3 OF R-TYPE EXECUTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31 26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25 21] Read PC Read register 1 address Read Instruction [20 16] Read data 1 register 2 Zero Instruction 0 Registers Read ALU [31–0] 0 ALU Read M Write data 2 result Address 1 Instruction register M data u M memory x u Instruction [15 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15 0] Sign extend ALU control

Instruction [5 0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PHASE 4 OF R-TYPE EXECUTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31 26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25 21] Read PC Read register 1 address Read Instruction [20 16] Read data 1 register 2 Zero Instruction 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15 0] Sign extend ALU control

Instruction [5 0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OPERATION OF A LOAD INSTRUCTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read PC Read register 1 address Read Instruction [20–16] Read data 1 Zero Instruction register 2 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15–11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15–0] Sign extend ALU control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OPERATION OF A BRANCH INSTRUCTION

0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read PC Read register 1 address Read Instruction [20–16] Read data 1 register 2 Zero Instruction 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15–11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15–0] Sign extend ALU control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXTENSIONS TO HANDLE JUMP INSTRUCTIONS

Instruction [25–0] Shift Jump address [31–0] left 2 26 28 0 1

PC+4 [31–28] M M u u x x Add ALU 10 result Add RegDst Shift Jump left 2 4 Branch MemRead Instruction [31–26] Control MemtoReg ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read Read register 1 PC address Read Instruction [20–16] Read data 1 Zero Instruction register 2 0 Registers Read ALU [31–0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15–11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15–0] Sign extend ALU control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PERFORMANCE OF A SINGLE-CYCLE IMPLEMENTATION

• Our simple design implements control, ALU operations, loads/stores, and branches in one large combinational logic block • Good news: Our design assures that everyinstruction takes exactly one clock period CPI=1 • Bad news: The clock period can be no shorter than the time required for the instruction with the longest delay The instruction with the most steps is likelyto be the slowest, other things (such as number of memoryaccesses) being equal The lw instruction uses five functional units in series (instruction memory, register file, ALU, data memory, and then the register file again) Example: Assume that memoryaccesses and ALU operations take 2 ns each, while register accesses take 1 ns each ◦ With these assumptions, lw takes 8ns c C. D. Cantrell (05/1999) Worst Case Timing (Load)

Clk Clk-to-Q PC Old Value New Value Instruction Memory Access Time Rs, Rt, Rd, Old Value New Value Op, Func Delay through Control Logic ALUctr Old Value New Value

ExtOp Old Value New Value

ALUSrc Old Value New Value

MemtoReg Old Value New Value Register Write Occurs RegWr Old Value New Value Register File Access Time busA Old Value New Value Delay through Extender & Mux busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New

©UCB, DAP 97 Multicycle Implementation: Concept

¥ Divide the data path into multiple clock cycles — instructions take from 3 to 5 cycles IF RF EX MEM. WB Instruction Register Execution Memory Write Fetch Fetch back

R ALU R P Instr. e Data e C Memory g Memory g s s

John L. Hennessy THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A MULTI-CYCLE IMPLEMENTATION (1)

• Each step in the execution of an instruction takes 1 clock period Different types of instructions can take different numbers of clock periods ◦ The clock period is no longer constrained bythe longest execution time A functional unit can be used more than once per instruction if it is used in different clock periods ◦ Less hardware maybe needed than for a single-cycle implementation ◦ Data and instruction memories can be combined into a single main memory ◦ We can get away with having a single ALU instead of an ALU and two adders

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STEPS IN EXECUTING AN INSTRUCTION

Step R-type Memoryreference Branches Jumps Instruction IR = M[PC] Fetch PC=PC+4 Instruction A = Reg[IR[25–21]] decode, B = Reg[IR[20–16]] Register Fetch ALUOut = A + (sign-xtnd(IR[15–0])<<2) Execution, ALUOut=AopB ALUOut = PC If A == B then PC = PC[31–28] address comp., + (sign-xtnd PC = ALUOut || branch/jump (IR[15–0]<<2) (IR[25–0]<<2) completion Memoryaccess Reg[IR[15–11]] Load: MDR or = ALUOut = M[ALUOut] R-type completion Store: M[ALUOut] = B Memoryread Load: Reg[IR[20–16]] completion = MDR

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TIMING ISSUES IN MULTCYCLE IMPLEMENTATIONS

• Edge-triggered timing Instructions maytake multiple clock periods to complete execution An instruction mayneed to write to different functional units in different clock periods • Data used in a clock period must be stabilized in one of two ways: Driven from a register that was written in an earlier clock period Driven from a combinational logic block with register-driven inputs ◦ Example: Suppose that the inputs to the ALU are stable ◦ The outputs are determined bythe ALU’s combinational logic ◦ Therefore the ALU outputs reallydo not need to be latched ◦ If a functional unit is used more than once per instruction, then its outputs must be latched so that theywill not be overwritten ◦ Not latching a functional unit’s outputs creates a multicycle delay path

c C. D. Cantrell (05/1999) A Multiple Cycle Delay Path ¥ There is no register to save the results between: — Register Fetch: busA ← Reg[rs]; busB ← Reg[rt] — R-type Execution: ALU output ← busA op busB — R-type Completion: Reg[rd] ← ALU output

Registers here ALUselA PCWr to save outputs of Rfetch?

0Mux Rs Zero Register Instr. Reg Ra busA here to save Rt 5 1 32 ALU 32 Rb 32 output of 5 Reg File Rt 0Mux 4 0 RExec? Rw 32 1 Rd busW 32 32 1 busB 2 1 Mux 0 3 ALU Control ALUselB ALUOp

John L. Hennessy THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A MULTI-CYCLE IMPLEMENTATION (2)

• A special-purpose register should be added after every major functional unit in order to remember the output of that unit until the data has been used in a later clock period  This is necessary for data that is needed by one instruction in a later step of its execution ◦ We call this temporary data  Making data available to subsequent instructions is accomplished by saving the data in a general-purpose register

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

REGISTERS FOR TEMPORARY DATA

Instruction register Data PC Address A Register # Instruction Memory or data Registers ALU ALUOut Memory Register # data B Data register Register #

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTICYCLE DATAPATH (FIRST CUT)

PC 0 0 M Instruction Read Address [25–21] register 1 M u u x Read A x 1 Memory Instruction Read Zero [20–16] register 2 data 1 1 ALU MemData 0 Registers ALU ALUOut result Instruction M Write Read [15–0] Instruction u register data 2 B 0 Write Instruction [15–11] x 4 1 M data Write register 1 u data 2 x Instruction 0 3 [15–0] M u x Memory 1 data 16 32 Sign Shift register extend left 2

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTICYCLE DATAPATH AND CONTROL (FIRST CUT)

IorD MemRead MemWrite IRWrite RegDst RegWrite ALUSrcA

PC 0 0 M Instruction Read [25–21] register 1 M u Address u x Read A x 1 Memory Instruction Read data 1 Zero [20–16] register 2 1 ALU MemData 0 Registers ALU ALUOut Instruction M Write Read result [15–0] register B Instruction u data 2 0 Write Instruction [15–11] x 4 1 M data Write register 1 u data 2 x Instruction 0 3 [15–0] M u x Memory 1 data 16 32 Sign Shift ALU register extend left 2 control

Instruction [5–0]

MemtoReg ALUOpALUSrcB

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPLETED MULTICYCLE DATAPATH AND CONTROL

PCWriteCond PCSource PCWrite ALUOp IorD Outputs ALUSrcB MemRead ALUSrcA MemWrite Control RegWrite MemtoReg Op RegDst IRWrite [5–0] 0 M Jump 1 u 26 28 x Instruction [25–0] Shift address [31-0] 2 left 2 Instruction [31-26] PC [31-28] PC 0 0 M Instruction Read [25–21] register 1 M u Address u x Read A x 1 Memory Instruction Read Zero [20–16] register 2 data 1 1 ALU MemData 0 Registers ALU ALUOut Instruction Write result M Read B [15–0] Instruction u register data 2 0 Write Instruction [15–11] x 4 1 M data Write register 1 u data 2 x Instruction 0 3 [15–0] M u x Memory 1 data 16 32 Shift ALU register Sign extend left 2 control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

METHODS FOR DESIGNING MULTICYCLE CONTROL

Initial Finite state Microprogram representation diagram

Sequencing Explicit next Microprogram counter control state function + dispatch ROMS

Logic Logic Truth representation equations tables

Implementation Programmable Read only technique logic array memory

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MICROPROGRAMMING

• Used to implement complex finite-state-machine control • Strategy: Set of values of processor control signals = microinstruction Microinstruction sequence is determined by: ◦ Inputs from MIPS instruction Opcode and Function fields ◦ Current state of the processor • Hardware implementations: PLA for sequencing microinstructions and generating processor control signals ROM for storing both microinstruction and pointer to next state

c C. D. Cantrell (07/1999) Macroinstruction Interpretation

¥ Macroinstruction is implemented by microinstruction sequence

User program Main ADD plus Data Memory SUB AND (these can change!) . . . DATA one of these is mapped into one of these

CPU control microsequence memory e.g., Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FINITE STATE MACHINE CONTROL (OVERVIEW)

Start

Instruction fetch/decode and register fetch (Figure 5.37)

Memory access R-type instructions Branch instruction Jump instruction instructions (Figure 5.39) (Figure 5.40) (Figure 5.41) (Figure 5.38)

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INSTRUCTION FETCH/DECODE FSM

Instruction decode/ Instruction fetch Register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 IRWrite ALUSrcA = 0 Start ALUSrcB = 01 ALUSrcB = 11 ALUOp = 00 ALUOp = 00 PCWrite PCSource = 00

(Op = R-type)

(Op = 'BEQ') (Op = 'JMP') (Op = 'LW') or (Op = 'SW')

Memory reference FSM R-type FSM Branch FSM Jump FSM

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MEMORY REFERENCE FSM

From state 1 (Op = 'LW') or (Op = 'SW') Memory address computation 2

ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00

(Op = 'SW')

Memory Memory (Op = 'LW') access access 53

MemRead MemWrite IorD = 1 IorD = 1

Write-back step 4

RegWrite To state 0 MemtoReg = 1 RegDst = 0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

R-TYPE FSM

From state 1 (Op = R-type)

Execution 6

ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10

R-type completion 7 RegDst = 1 RegWrite MemtoReg = 0

To state 0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BRANCH FSM

From state 1 (Op = 'BEQ')

Branch completion 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01

To state 0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

JUMP FSM

From state 1 (Op = 'J')

Jump completion 9

PCWrite PCSource = 10

To state 0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FSM FOR MULTICYCLE PROCESSOR

Instruction decode/ Instruction fetch register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 ALUSrcA = 0 Start IRWrite ALUSrcB = 11 ALUSrcB = 01 ALUOp = 00 ALUOp = 00 PCWrite PCSource = 00 e) ') yp Q -t E R B p = ' ') (O = Memory address 'SW p = Branch (O Jump p 'J') = (Op computation r (O ') o Execution completion completion 'LW p = (O 9862 ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA =1 PCWrite ALUSrcB = 10 ALUOp = 01 ALUSrcB = 00 PCSource = 10 ALUOp = 00 ALUOp= 10 PCWriteCond PCSource = 01

(O p =

'S W ') Memory Memory (Op = 'LW') = (Op access access R-type completion 753

RegDst = 1 MemRead MemWrite RegWrite IorD = 1 IorD = 1 MemtoReg = 0

Write-back step 4

RegDst=0 RegWrite MemtoReg=1

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTATION USING A NEXT-STATE FUNCTION

Combinational control logic Datapath control outputs

Outputs

Inputs

Next state Inputs from instruction State register register opcode field

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition Implementation: Programmed Logic Arrays ¥ Each output = + of ¥ of inputs and their complements. Each product (AND) term (called minterm) = one line in a truth table Example: one minterm NS1 =~S3¥~S2¥~S1¥S0¥Op5¥~Op4¥~Op2¥Op1¥Op0 Op5 Op4 Op3 Op2 R = 000000 Op1 beq = 000100 Op0 lw = 100011 S3 sw = 101011 S2 ori = 001011 jmp = 000010 S1 S0 0 = 0000 6 = 0110 1 = 0001 7 = 0111 NS3 2 = 0010 8 = 1000 NS2 3 = 0011 9 = 1001 4 = 0100 10 = 1010 NS1 5 = 0101 11 = 1011 NS0 Implementation Technique: PLA

¥ Complete next state function

Op5 lw = 100011 Op4 sw = 101011 Op3 R = 000000 Op2 ori = 001011 Op1 beq = 000100 Op0 jmp = 000010 S3 S2 S1 S0 0 = 0000 6 = 0110 1 = 0001 7 = 0111 NS3 2 = 0010 8 = 1000 NS2 3 = 0011 9 = 1001 NS1 4 = 0100 10 = 1010 NS0 5 = 0101 11 = 1011 Implementing Control with a ROM

¥ Instead of a PLA, use a ROM with one word per state (“Control word”) Control word=data path control + 2-bit sequence control State number Control Word Bits 18-2 Control Word Bits 1-0 0 10010100000001000 11 00ÐNext State<-0 1 00000000010011000 01 01ÐDispatch via 1 2 00000000000010100 10 10ÐDispatch via 2 3 00110000000010100 11 11ÐIncrement State 4 00110010000010110 00 5 00101000000010100 00 6 00000000001000100 11 7 00000000001000111 00 8 01000000100100100 00 9 10000001000000000 00 10 … 11 11 … 00 Macroinstruction Interpretation

¥ Macroinstruction is implemented by microinstruction sequence

User program Main ADD plus Data Memory SUB AND (these can change!) . . . DATA one of these is mapped into one of these execution unit

CPU control microsequence memory e.g., Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPLEMENTATION USING A CONTROLLER

Microcode storage

Datapath Outputs control outputs

Input 1

Sequencing Microprogram counter control Adder

Address select logic

Inputs from instruction register opcode field

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition Sequencer-based control unit

Control Logic Multicycle Datapath Outputs Controls

Inputs 1 Types of “branching” in our implementation: ¥ Set state to 0 State Reg ¥ Dispatch (state 1 & 2) Adder ¥ Use incremented state number Address Select Logic

Opcode Sequencer-based control unit details

Control Logic Inputs Dispatch ROM 1 1 Op Name State 000000 Rtype 0110 State Reg 000010 jmp 1001 Adder 000100 beq 1000 001011 ori 1010 302 Mux 1 100011 lw 0010 101011 sw 0010 0 Address ROM2 ROM1 Dispatch ROM 2 Select Op Name State Logic 100011 lw 0011 101011 sw 0101 Opcode VAX Microinstructions

¥ VAX : — 96 bit — 30 fields — 4096 µinstructions for VAX ISA — encodes concurrently executable "micro-operations"

95 87 84 68 65 63 11 0 USHF UALU USUB UJMP

001 = left 010 = A-B-1 00 = Nop ALU Shifter010 = right 100 ALU= A+B+1 01 = CALL Control Control Jump . 10 = RTNSubroutine Address . Control . 101 = left3 Microprogramming: One Inspiration for RISC

¥ If simple (micro) instructions can execute at high clock rate... ¥ If you could write compilers to produce microinstructions… ¥ If programs use mostly simple instructions and addressing modes… ¥ If microcode is kept in RAM instead of ROM so as to fix bugs … ¥ If same memory used for control memory could be used instead as cache for “macroinstructions”… ¥ Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? — Together with inspiration coming from ISA bloat, microprogramming help drove creation of ISAs that allowed simpler implementation, especially simpler control! Microprogramming Pros and Cons + Flexibility ¥ Adapt to changes in organization, timing, technology ¥ Can make changes late in design or in field + Can implement powerful instruction sets ¥ Historical perspective: microprogramming contributed to growth in ISA complexity and size + Generality ¥ Can implement multiple ISAs on same machine. ¥ Can tailor instruction set to application. + Compatibility ¥ Many organizations, same instruction set — Costly to implement — Slow Bottom line: Very limited role in implementing modern ISAs in modern technologies. Larger role for special-purpose machines. Legacy Software & Microprogramming

¥ IBM bet company on 360 Instruction Set Architecture (ISA): — single instruction set for many classes of machines (8- bit to 64-bit) ¥ Stewart Tucker stuck with job of what to do about software compatibility with earlier models — Why not use multiple microprograms to do multiple instruction sets on the same microarchitecture? — Coined term “emulation”: instruction set interpreter in microcode for non-native instruction set — Very successful: in early years of IBM 360 it was hard to know whether old instruction set or new instruction set was more frequently used THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTICYCLE DATA PATH WITH EXCEPTIONS

PCWriteCond CauseWrite IntCause PCWrite EPCWrite Outputs IorD PCSource MemRead ALUOp ALUSrcB MemWrite Control ALUSrcA MemtoReg RegWrite Op IRWrite [5–0] RegDst 0 M Jump 1 Instruction [25–0] 26 28 u Shift address [31-0] x left 2 2 Instruction CO 00 00 00 3 [31-26] PC [31-28] PC 0 0 M Instruction Read [25–21] register 1 M u Address u x Memory Read A x 1 Instruction Read Zero [20–16] register 2 data 1 1 ALU MemData 0 Registers ALU ALUOut EPC Instruction Write result M Read B [15–0] Instruction u register data 2 0 x Write Instruction [15–11] 4 1 M data Write register 1 u data 2 x Instruction 0 3 0 0 [15–0] M M u u Cause x x Memory 1 1 1 data 16 32 Sign Shift ALU register extend left 2 control

Instruction [5–0]

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STATES FOR EXCEPTION HANDLING

11 IntCause = 1 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource = 11 10 PCSource = 11 IntCause = 0 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 To state 0 to begin next instruction EPCWrite PCWrite PC++Source = 11

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MULTICYCLE FSM WITH EXCEPTION HANDLING

Instruction decode/ Instruction fetch Register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 ALUSrcA = 0 IRWrite ALUSrcB = 11 Start ALUSrcB = 01 ALUOp = 00 ALUOp = 00 PCWrite PCSource = 00 (Op = other)

(Op = R-type) Memory address Branch (Op = 'BEQ') Jump computation Execution completion (Op = 'J') completion 9862 (Op = 'LW') or (Op = 'SW') ALUSrcA = 1 ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 PCWrite ALUSrcB = 00 ALUSrcB = 00 ALUOp = 01 PCSource = 10 ALUOp = 00 ALUOp = 10 PCWriteCond PCSource = 01 (Op = 'SW')

Memory Memory

(Op = 'LW') access access R-type completion IntCause = 1 IntCause = 0 53 71110 CauseWrite CauseWrite RegDst = 1 ALUSrcA = 0 ALUSrcA = 0 MemRead MemWrite RegWrite Overflow ALUSrcB = 01 ALUSrcB = 01 IorD = 1 IorD = 1 MemtoReg = 0 ALUOp = 01 ALUOp = 01 EPCWrite EPCWrite PCWrite PCWrite PCSource = 11 PCSource = 11

Write-back step 4 Overflow

RegWrite MemtoReg = 1 RegDst = 0

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINING: A CONTINUATION OF

• We found that the single-cycle implementation wastes time  All instructions take as long as the instruction with the longest delay (lw) • In the multicycle implementation:  The clock period is much shorter than in the single-cycle implementation  Instructions take only as many clock periods as they need  BUT: Each functional unit is used only once or twice in executing an instruction ◦ We need an implementation in which each functional unit is busy in every clock period ◦ This is possible if we cut the execution of an instruction into stages, and then overlap the execution of different instructions

c C. D. Cantrell (12/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

LAUNDRY: SEQUENTIAL vs. PIPELINED EXECUTION

76 PM76 8 9 10 11 12 1 2 AM Time Task order A

B

C

D

76 PM76 8 9 10 11 12 1 2 AM Time

Task order

A

B

C

D

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINING

In a pipelined computer architecture, a single processor can execute several instructions concurrently . Execution of one instruction uses several hardware functional units (instruction memory, register le, ALU, data memory, etc.) . Most functional units are used once per instruction The register le may be read and written by the same instruction . The hardware functional units are organized into stages Execution at each stage takes 1 clock period Stages are separated by clock-controlled pipeline registers that preserve the state of execution for the duration of a clock period . The pipeline is subject to hazards Data hazards: Write/read conicts or timing problems Control hazards: Exceptions and branches The MIPS R2000 pipeline design strongly inuenced the design of all subsequent processors c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPUTER: SEQUENTIAL vs. PIPELINED EXECUTION

Program execution 2 4 6 8 10 12 14 16 18 order Time (in instructions) Instruction Data Reg ALU Reg lw $1, 100($0) fetch access

Instruction Data Reg ALU Reg lw $2, 200($0) 8 ns fetch access

Instruction lw $3, 300($0) 8 ns fetch ... 8 ns

Program execution 2 4 6 8 10 12 14 Time order (in instructions) Instruction Data Reg ALU Reg lw $1, 100($0) fetch access

Instruction Data Reg ALU Reg lw $2, 200($0) 2 ns fetch access

Instruction Data Reg ALU Reg lw $3, 300($0) 2 ns fetch access

2 ns 2 ns 2 ns 2 ns 2 ns

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINING (2)

• Pipeline speedup: A pipelined processor with s stages can execute n instructions in

ETP = s +(n − 1) clock periods (assuming no hazards) A serial processor executes the same n instructions in

ETS = ns clock periods The ideal pipeline speedup equals the number of stages: ET ns S = S = −−−→ s P ETP s +(n − 1) n s • Amdahl’s law applies to pipelining

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROGRAMMING IMPLICATIONS OF PIPELINING

Avoid function or subprogram calls in an inner loop . Jumps force the pipeline to be ushed Avoid recursion in an inner loop . Recursion on the elements of an array generally causes data hazards because the value of v[n] has not been written before it is needed for the computation of v[n+1] Avoid scalar temporary variables in an inner loop . Reading a memory-resident scalar variable may cause a data hazard Avoid case and switch statements in an inner loop . Conditional branches cause control hazards, and the use of a jump table may cause data hazards

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS PIPELINES (1)

MIPS R2000 integer unit pipeline stages (Patterson & Hennessy, Chapter 6) 1. Instruction Fetch (IF) 2. Instruction Decode (ID) and Register Fetch 3. Execute (EX or ALU) . ALU operations, condition evaluation, address computation 4. Memory access (MEM) 5. Write back (WB) to register le

Clock 1 2 3 4 5 periods

IF ID EX MEM WB

c C. D. Cantrell (05/1999) IF IS RF EX DF DS TC WB

Instruction memory Reg Data memory Reg ALU

The eight-stage pipeline of the R4000 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS PIPELINES (2)

MIPS R2000 oating-point unit pipeline stages 1. Instruction Fetch (IF) 2. Register Fetch and Instruction Decode (RD) . FPU decodes instruction on bus to see if it’s oating-point . FPU reads data from its registers 3. Execute (EX or ALU) 4. Memory access (MEM) 5. Exception processing (stage called WB for correspondence with integer pipeline) 6. Write back (FWB) Clock 1 2 3 4 5 6 periods

IF ID EX MEM WB FWB

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA HAZARDS (1)

• Data hazards occur when the order of read and write actions is not the order in strictly sequential execution Hazards are named by the ordering in the program that must be preserved in the course of pipelined execution ◦ In the following, the order of execution should be i, then j RAW(read after write) — j reads a source before i has written it ◦ j incorrectly gets the old value ◦ Most common kind of data hazard WAR (write after read) — j writes a destination before i reads it ◦ i incorrectly gets the new value WAW (write after write) — j should write an operand after i writes it, but the writes are performed in the wrong order, incorrectly leaving the value written by i • Hazards limit pipeline speedup and complicate design c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA HAZARDS (2)

RAW hazards are generated by the instructions sub $2,$1,$3 and $12,$2,$5 or $13,$6,$2 add $14,$2,$2 sw $15,100($2)

Clock Periods Program execution 1 2 3 4 5 6 7 8 9 order (in instructions) Instruction Data Reg ALU Reg sub $2, $1, $3 fetch access

Instruction Data Reg ALU Reg and $12, $2, $5 fetch access

Instruction Data Reg ALU Reg or $13, $6, $2 fetch access

Instruction Data Reg ALU Reg add $14, $2, $2 fetch access

Instruction Data Reg ALU Reg sw $15, 100($2) fetch access

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATA HAZARDS (3)

RAW hazards with i = sub $2,$1,$3 and source = $2 are generated by the instructions sub $2,$1,$3 and $12,$2,$5 or $13,$6,$2 add $14,$2,$2 sw $15,100($2) . The instruction j = and $12,$2,$5 reads $2 before i writes it . The instruction j = or $13,$6,$2 reads $2 before i writes it . The instruction j = add $14,$2,$2 reads $2 in the same clock period in which i writes it Generates a hazard if the register le’s outputs change only on the edge of the main processor clock

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (1)

Follows ideas of K. Kennedy et al. Denition: If control ow within a program can reach statement T after passing through statement S, then T depends on S . Dependence is always dened by reference to the results of serial execution Assume a loop . Dependence analysis outside a loop is trivial Assume an array . Dependences that aect pipelining arise from references to the same memory location M[n] in an array

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (2)

Distinguish between statements (in a program) and instances of those statements (in loop instances or threads) . Let i = loop induction variable Example: A for loop in C Syntax: for ( i=0; i

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (3)

Anti-dependence (WAR): S reads M[n] and T writes M[n] iS iT S: = F[X[fS[i]]] T: X[f [i]] = T

Output dependence (WAW): S and T both write M[n] iS iT S: X[fS[i]] = T: X[f [i]] = T

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (4)

Requirement for the existence of an instance of dependence: A real memory location M[n] exists such that

M[n] = fS[iS]=fT [iT ] where S is executed before T and both values of the induction variable are in the range of the loop: p i i q S T Example: f and f are linear functions S T fS[i]=aSi + bS,fT [i]=aT i + bT The requirement for dependence implies that

aSiS + bS = aT iT + bT a i a i +(b b )=0 ⇒ S S T T S T

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (5)

Example: do 100 i=2,100 T: b(i)=a(i-1) S: a(i)=c(i)

. Here, f (i)=i, f (i)=i 1 S T . Condition f (i )=f (i )isi = i 1 i

. The equation iS = iT 1 has lots of solutions such that 2 i

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (6)

In serial execution, T reads from a(2) after S writes to a(2): 3 2 i =2: T2: b(2)=a(1) S2: a(2)=c(2)

i =3: T3: b(3)=a(2) S3: a(3)=c(3)

In vector execution, T reads from a(2) before S writes to it: 3 2 b(2)=a(1) b(3)=a(2) . a(2)=c(2) a(3)=c(3)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CONTROL HAZARDS

A control hazard occurs because a branch or jump instruction needs to make a decision based on the results of operations or instructions that are still pending . A beq instruction cannot update the PC before the test for equality has completed We’ll see later that it’s possible to make the decision at the In- struction Decode/Register Fetch stage instead of the ALU stage . In the MIPS ISA, only the instruction immediately following the branch is executed or not executed, depending on the branch decision This isn’t true in longer pipelines . Methods for dealing with control hazards Stall Predict the branch Always execute the instruction following the branch c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STALLING AS A SOLUTION FOR CONTROL HAZARDS

After a conditional branch (beq), there is a one-stage (bubble)

Program execution 2 4 6 81012 14 16 order Time (in instructions) Instruction Data Reg ALU Reg add $4, $5, $6 fetch access

Instruction Data beq $1, $2, 40 Reg ALU Reg 2ns fetch access Instruction Data lw $3, 300($0) Reg ALU Reg 4 ns fetch access

2ns

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PREDICTING BRANCHES NOT TAKEN AS A SOLUTION

Program execution 2 4 6 8 10 12 14 order Time (in instructions) Instruction Data Reg ALU Reg add $4, $5, $6 fetch access

Instruction Data beq $1, $2, 40 Reg ALU Reg 2 ns fetch access Instruction Data lw $3, 300($0) Reg ALU Reg 2 ns fetch access

Program execution 2 4 6 8 10 12 14 Time order (in instructions) Instruction Data add $4, $5 ,$6 Reg ALU Reg fetch access

Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access 2 ns

bubble bubble bubble bubble bubble

Instruction Data or $7, $8, $9 Reg ALU Reg 4 ns fetch access After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINE DELAYED BRANCH AS A SOLUTION

After the conditional branch (beq), there is an add instruction (which can do useful work) instead of a stall (bubble), which does nothing

Program execution 2 4 6 8 10 12 14 order Time (in instructions) Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access

Instruction Data add $4, $5, $6 Reg ALU Reg (Delayed branch slot) 2 ns fetch access Instruction Data lw $3, 300($0) Reg ALU Reg 2 ns fetch access

2 ns

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FORWARDING AS A SOLUTION FOR DATA HAZARDS (1)

Program execution 2 4 6 8 10 order Time (in instructions) add $s0, $t0, $t1 IF ID EX MEM WB

sub $t2, $s0, $t3 IF ID EX MEM WB

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FORWARDING AS A SOLUTION FOR DATA HAZARDS (2)

2 4 6 8 10 12 14 Program Time execution order (in instructions)

lw $s0, 20($t1) IF ID EX MEM WB

bubble bubble bubble bubble bubble

sub $t2, $s0, $t3 IF ID EX MEM WB

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition SINGLE-CYCLE DATAPATH

IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back register file read address calculation 0 M u x 1

Add

Add Add 4 result Shift left 2

Read PC Address register 1 Read Read data 1 register 2 Zero Instruction Registers Read ALU 0 ALU Write data 2 result Address Read 1 data Instruction register M u Data M memory Write x u memory x data 1 0 Write data 16 32 Sign extend

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition PIPELINED EXECUTION IN SINGLE-CYCLE DATAPATH

Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 execution order (in instructions) lw $1, 100($0) IM Reg ALU DM Reg

lw $2, 200($0) IM Reg ALU DM Reg

lw $3, 300($0) IM Reg ALU DM Reg

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition SINGLE-CYCLE DATAPATH 0 M WITH PIPELINE REGISTERS u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read register 1 PC Address Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Input Output Input Output Input Output Input Output side side side side side side side side

Because the state of a D flip-flop changes only on clock edges, new data can be asserted on the inputs of the pipeline registers while the data written in the previous clock period is still valid on the outputs lw Instruction fetch

0 M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend lw 0 M Instruction decode u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend lw 0 M u Execution x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read register 1 PC Address Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend lw 0 M Memory u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read register 1 PC Address Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 data register M Data u M Write x memory u x data 1 0 Write data 16 32 Sign extend 0 lw M u x Write back 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read register 1 PC Address Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data Data M u memory Write x u x data 1 0 Write data 16 32 Sign extend sw 0 M Execution u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read register 1 PC Address Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data 16 32 Sign extend sw 0 M Memory u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read register 1 PC Address Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend 0 sw M u Write back x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Shift left 2

Read PC Address register 1 Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data 16 32 Sign extend THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DATAPATH MODIFICATIONS FOR PIPELINING

The number of the register that an instruction must write to is read in the ID stage Consider two instructions: lw $10, 20($1) sub $11, $2, $3 . WriteRegister signal values are 10 (for lw) and 11 (for sub) . The WriteRegister signal is read only in the WB stage . The sub’s ID stage modies WriteRegister before lw can read it . Therefore the value of the WriteRegister signal is part of the instruction’s state, and must be passed along in pipeline registers as the instruction executes

c C. D. Cantrell (05/1999) DATAPATH MODIFIED TO

0 HANDLE A LOAD M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data 16 32 Sign extend PIPELINE STAGES USED BY 0 M A LOAD INSTRUCTION u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x u memory x data 1 0 Write data 16 32 Sign extend TWO REPRESENTATIONS OF PIPELINED EXECUTION

Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 execution order (in instructions) lw $10, 20($1) IM Reg ALU DM Reg

sub $11, $2, $3 IM Reg ALU DM Reg

Time ( in clock cycles) Program execution order CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 (in instructions) Data lw $10, $20($1) Instruction Instruction Execution Write back fetch decode access Instruction Instruction Data sub $11, $2, $3 fetch decode Executionaccess Write back

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition CLOCK PERIOD 1

lw $10, 20($1) Instruction fetch

0 M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read register 1 PC Address Read data 1 Read Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 1 CLOCK PERIOD 2

sub $11, $2, $3 lw $10, 20($1) Instruction fetch Instruction decode

0 M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read register 1 PC Address Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 2 CLOCK PERIOD 3

sub $11, $2, $3 lw $10, 20($1) Instruction decode Execution

0 M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 3 CLOCK PERIOD 4

sub $11, $2, $3 lw $10, 20($1) 0 M u Execution Memory x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 4 CLOCK PERIOD 5

sub $11, $2, $3 lw $10, 20($1) 0 M u Memory Write back x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 5 CLOCK PERIOD 6

sub $11, $2, $3 0 M u Write back x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add result Shift left 2

Read PC Address register 1 Read Read data 1 Zero Instruction Instruction register 2 Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend

Clock 6 PIPELINED DATAPATH WITH CONTROL SIGNALS

PCSrc

0 M u x 1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add Add 4 result Branch Shift RegWrite left 2

Read MemWrite register 1 PC Address Read data 1 Read ALUSrc ZeroZero MemtoReg Instruction Instruction register 2 Registers Read ALU ALU memory 0 Read Write data 2 result Address 1 register M data M u Data Write x u memory x data 1 0 Write data Instruction [15–0] 16 32 6 Sign ALU extend control MemRead

Instruction [20–16] 0 M ALUOp Instruction u [15–11] x 1

RegDst THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SETTINGS OF CONTROL LINES

Ex/Address Calc. Mem. Access WriteBack Inst. Reg ALUOp ALUOp ALU Br Mem Mem Reg Mem- type Dst bit 1 bit 0 Src Read Write Write to-reg. R-type 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw d 0 0 1 0 0 1 0 d beq d 0 1 0 1 0 0 0 d

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINED CONTROL

The control signals for an instruction are determined in the ID stage . The next instruction’s ID stage asserts new values of the control signals . Control signals are part of the state of the instruction, and therefore must be passed along from stage to stage in pipeline registers, just like data

c C. D. Cantrell (05/1999) CONTROL LINES FOR THE THREE FINAL STAGES

WB

Instruction Control M WB

EX M WB

IF/ID ID/EX EX/MEM MEM/WB PIPELINED DATAPATH WITH CONTROL LOGIC AND SIGNALS

PCSrc

ID/EX 0 M u WB x EX/MEM 1 Control M WB MEM/WB

EX M WB IF/ID

Add

4 Add Add result Branch Shift left 2

RegWrite ALUSrc Read register 1 PC Address Read MemWrite data 1 Read Zero

Instruction register 2 Instruction MemtoReg Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction 16 32 6 [15–0] Sign ALU MemRead extend control

Instruction [20–16] 0 ALUOp M Instruction u [15–11] x 1 RegDst THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINED EXECUTION OF FIVE INSTRUCTIONS

We’ll follow what happens in the instruction sequence [40000024] lw $10, 20($1) [40000028] sub $11, $2, $3 [4000002c] and $12, $4, $5 [40000030] or $13, $6, $7 [40000034] add $14, $8, $9 . For each clock period, note the values of the RegisterWrite signal in the ID/EX, EX/MEM, and MEM/WB pipeline registers

c C. D. Cantrell (05/1999) IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

IF/ID ID/EX EX/MEM MEM/WB 0 M 00 00 u WB x 1 000 000 00 Control M WB 0 0 0 0000 00 0 EX M WB 0 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc Read

PC Address register 1 Read MemWrite Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction [15–0] Sign ALU MemRead extend control

Instruction [20–16] 0 ALUOp M Instruction u [15–11] x 1 Clock 1 RegDst IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

IF/ID ID/EX EX/MEM MEM/WB 0 M 11 00 u WB x 1 lw 010 000 00 Control M WB 0 0 0 0001 00 0 EX M WB 0 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc 1 Read

PC Address register 1 Read $1 MemWrite X Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read $X ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction 20 [15–0] 20 Sign ALU MemRead extend control

Instruction 10 [20–16] 10 0 ALUOp M Instruction u X [15–11] X x 1 Clock 2 RegDst IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

IF/ID ID/EX EX/MEM MEM/WB 0 M 10 11 u WB x 1 sub 000 010 00 Control M WB 0 0 0 1100 00 0 EX M WB 0 1 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc 2 Read

PC Address register 1 Read $2 $1 MemWrite 3 Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read $3 ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction X [15–0] Sign X 20 ALU MemRead extend control

Instruction X [20–16] X 10 0 ALUOp M Instruction u 11 [15–11] 11 x 1 Clock 3 RegDst IF: or $13, $6, $7 ID: and $12, $4, $5 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

IF/ID ID/EX EX/MEM MEM/WB 0 M 10 10 u WB x 1 and 000 000 11 Control M WB 1 0 0 1100 10 1 EX M WB 0 0 0

Add

4 Add Add result Branch Shift left 2

RegWrite ALUSrc 4 Read register 1 PC Address Read $4 $2 MemWrite 5 Read data 1 Zero

Instruction register 2 Instruction MemtoReg Registers Read $5 $3 ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction X [15–0] Sign X ALU MemRead extend control

Instruction X [20–16] X 0 ALUOp M 10 Instruction u 12 [15–11] 12 11 x 1 Clock 4 RegDst IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

IF/ID ID/EX EX/MEM MEM/WB 0 M 10 10 u WB x 1 or 000 000 10 Control M WB 1 0 1 1100 10 0 EX M WB 1 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc 6 Read

PC Address register 1 Read $6 $4 MemWrite 7 Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read $7 $5 ALU memory 10 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction X [15–0] X Sign ALU MemRead extend control

Instruction X [20–16] X 0 ALUOp M 11 10 Instruction u 13 [15–11] 13 12 x Clock 5 1 RegDst IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

IF/ID ID/EX EX/MEM MEM/WB 0 M 10 10 u WB x 1 add 000 000 10 Control M WB 1 0 1 1100 10 0 EX M WB 0 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc 8 Read

PC Address register 1 Read $8 $6 MemWrite 9 Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read $9 $7 ALU memory 11 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction X [15–0] X Sign ALU MemRead extend control

Instruction X [20–16] X 0 ALUOp M 12 11 Instruction u 14 [15–11] 14 13 x 1 Clock 6 RegDst IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

IF/ID ID/EX EX/MEM MEM/WB 0 M 00 10 u WB x 1 000 000 10 Control M WB 1 0 1 0000 10 0 EX M WB 0 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc Read

PC Address register 1 Read $8 MemWrite Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read $9 ALU memory 12 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction [15–0] Sign ALU MemRead extend control

Instruction [20–16] 0 ALUOp M 13 12 Instruction u [15–11] 14 x 1 Clock 7 RegDst IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

IF/ID ID/EX EX/MEM MEM/WB 0 M 00 00 u WB x 1 000 000 10 Control M WB 0 0 1 0000 00 0 EX M WB 0 0 0

Add

4 Add Add result Shift Branch left 2 RegWrite ALUSrc Read

PC Address register 1 Read MemWrite Read data 1 Zero Instruction Instruction register 2 MemtoReg Registers Read ALU memory 13 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction [15–0] Sign ALU MemRead extend control

Instruction [20–16] 0 ALUOp M 14 13 Instruction u [15–11] x 1 Clock 8 RegDst IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

IF/ID ID/EX EX/MEM MEM/WB 0 M 00 00 u WB x 1 000 000 00 Control M WB 0 0 1 0000 00 0 EX M WB 0 0 0

Add

4 Add Add result Branch Shift left 2

RegWrite ALUSrc Read register 1 PC Address Read MemWrite Read data 1 Zero

Instruction register 2 Instruction MemtoReg Registers Read ALU memory 14 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data

Instruction [15–0] Sign ALU MemRead extend control

Instruction [20–16] 0 ALUOp M 14 Instruction u [15–11] x 1 Clock 9 RegDst THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS PIPELINES (3)

MIPS R2000 integer unit pipeline hazards, named for pipeline regis- ters: IF/ID . ReadRegisterj register name of register eld | {z } | {z }

1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j =1, 2) 2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j =1, 2) 3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j =1, 2)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MIPS PIPELINES (4)

MIPS R2000 integer pipeline hazards generated by the instructions sub $2,$1,$3 and $12,$2,$5 or $13,$6,$2 add $14,$2,$2 sw $15,100($2) . The instruction and $12,$2,$5 results in hazard 1a, ID/EX.WriteRegister = IF/ID.ReadRegister1 = $2 . The instruction or $13,$6,$2 results in hazard 2b, EX/MEM.WriteRegister = IF/ID.ReadRegister2 = $2 . The instruction add $14,$2,$2 results in hazards 3a and 3b, MEM/WB.WriteRegister = IF/ID.ReadRegister1 = $2 MEM/WB.WriteRegister = IF/ID.ReadRegister2 = $2

c C. D. Cantrell (05/1999) Time (in clock cycles) Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 register $2: 10 10 10 10 10/–20 –20 –20 –20 –20 Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/–20 –20 –20 –20 –20 Value of EX/MEM : XXX–20XXXXX Value of MEM/WB : XXXX–20XXXX

Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg PIPELINED DATAPATH WITHOUT FORWARDING

ID/EXEX/MEM MEM/WB

Registers ALU

Data memory M u x PIPELINED DATAPATH WITH FORWARDING

ID/EXEX/MEM MEM/WB

M u x Registers ForwardA ALU

M Data u memory x M u x

Rs ForwardB Rt Rt M EX/MEM.RegisterRd Rd u x Forwarding MEM/WB.RegisterRd unit DATAPATH MODIFIED TO RESOLVE HAZARDS BY FORWARDING

ID/EX

WB EX/MEM

Control M WB MEM/WB

IF/ID EX M WB

M u x Registers

Instruction Instruction Data PC ALU memory memory M u M x u x

IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt M EX/MEM.RegisterRd IF/ID.RegisterRd Rd u x

Forwarding MEM/WB.RegisterRd unit THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINED EXECUTION WITH FORWARDING

We’ll follow what happens in the instruction sequence [40000028] sub $2, $1, $3 [4000002c] and $4, $2, $5 [40000030] or $4, $4, $2 [40000034] add $9, $4, $2 . Without forwarding, there would be RAW hazards on register $2 in the and instruction and on register $4 in the or and add instructions

c C. D. Cantrell (05/1999) or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2>

ID/EX 10 10 WB EX/MEM

Control M WB MEM/WB

IF/ID EX M WB

2 $2 $1 M 5 u x Registers

Instruction Data Instruction ALU PC memory memory $5 $3 M u M x u x

2 1 5 3 M 4 2 u x Forwarding unit

Clock 3

ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2 (RAW data hazard) add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>

ID/EX 10 10 WB EX/MEM 10 Control M WB MEM/WB

IF/ID EX M WB

4 $4 $2 ALU M In1 2 u x Registers

Instruction Data Instruction ALU PC memory memory $2 $5 M ALU u M In2 x u x

4 2 2 5

M 2 4 4 u x 2 Forwarding unit

Clock 4

ID/EX.WriteRegister EX/MEM.WriteRegister = IF/ID.ReadRegister1 = ID/EX.ReadRegister1 = 4 and = 2 EX/MEM.WriteRegister (test by which the need = IF/ID.ReadRegister2 for forwarding to ALUIn1 = 2 is actually detected) (RAW data hazards) after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .

ID/EX 10 10 WB EX/MEM 10 Control M WB MEM/WB 1 IF/ID EX M WB

4 $4 $4 M 2 u x Registers

Instruction 2 Data Instruction ALU PC memory memory $2 $2 M u M x u x

4 4 2 2 M 4 9 4 u x

Forwarding unit

2 Clock 5

EX/MEM.WriteRegister EX/MEM.WriteRegister = IF/ID.ReadRegister1 = ID/EX.ReadRegister1 = 4 and = 4 and MEM/WB.WriteRegister MEM/WB.WriteRegister = IF/ID.ReadRegister2 = ID/EX.ReadRegister2 = 2 = 2 after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .

ID/EX 10 WB EX/MEM 10 Control M WB MEM/WB 1 IF/ID EX M WB Who wrote this value? $4 M u x Registers

Instruction 4 Data Instruction ALU PC memory memory $2 M u M x u x

4 2

M 4 4 9 u x Forwarding unit

Clock 6

EX/MEM.WriteRegister = ID/EX.ReadRegister2 = 4 ADDITION OF A MULTIPLEXOR TO CHOOSE THE IMMEDIATE VALUE

ID/EXEX/MEM MEM/WB

M u x Registers ALUSrc ALU

M M Data u u memory x x M u x

M u x Forwarding unit A DATA HAZARD THAT CANNOT BE RESOLVED BY FORWARDING

Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 execution order (in instructions) lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5 IM Reg DM Reg

or $8, $2, $6 IM Reg DM Reg

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg HOW STALLS ARE INSERTED INTO A PIPELINE

Program Time (in clock cycles) execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 order (in instructions)

lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5 IM Reg Reg DM Reg

or $8, $2, $6 IM IM Reg DM Reg

bubble

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg OVERVIEW OF PIPELINED CONTROL

Hazard ID/EX.MemRead detection unit ID/EX

WB EX/MEM M Control u M WB IF/IDWrite x MEM/WB 0 IF/ID EX M WB

M u PCWrite x Registers

Instruction Instruction Data PC ALU memory memory M u M x u x

IF/ID.RegisterRs IF/ID.RegisterRt Rt IF/ID.RegisterRt M EX/MEM.RegisterRd IF/ID.RegisterRd Rd u x ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd Rt unit THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PIPELINED EXECUTION WITH A STALL

We’ll follow what happens in the instruction sequence [40000028] lw $2, 20($1) [4000002c] and $4, $2, $5 [40000030] or $4, $4, $2 [40000034] add $9, $4, $2 . Note that the hardware inserts a stall after the lw instruction . Forwarding resolves the RAW hazards on register $2 in the and instruction and on register $4 in the or and add instructions

c C. D. Cantrell (05/1999) and $4, $2, $5 lw $2, 20($1) before<1> before<2> before<3> Hazard ID/EX.MemRead detection 1 unit ID/EX X 11 WB EX/MEM M Control u M WB IF/IDWrite x MEM/WB 0 IF/ID EX M WB

1 $1 M u PCWrite X x Registers

Instruction Data Instruction ALU PC memory memory $X M u M x u x

1 X 2 M u x ID/EX.RegisterRt Forwarding unit

Clock 2 or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2> Hazard detection ID/EX.MemRead 2 unit ID/EX 5 00 11 WB EX/MEM M Control u M WB IF/IDWrite x MEM/WB 0 IF/ID EX M WB

2 $2 $1 M u PCWrite 5 x Registers

Instruction Data Instruction ALU PC memory memory $5 $X M u M x u x

2 1 5 X 2 M 4 u x ID/EX.RegisterRt Forwarding unit

Clock 3 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1> Hazard ID/EX.MemRead detection 2 unit ID/EX 5 10 00 WB EX/MEM

M 11 Control u M WB IF/IDWrite x MEM/WB 0 IF/ID EX M WB

2 $2 $2 M u PCWrite 5 x Registers

Instruction Data Instruction ALU PC memory memory $5 $5 M u M x u x

2 2 5 5

M 2 4 4 u x ID/EX.RegisterRt Forwarding unit

Clock 4 add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . Hazard ID/EX.MemRead detection 4 unit ID/EX 2 10 10 WB EX/MEM M 0 Control u M WB IF/IDWrite x MEM/WB 0 11 IF/ID EX M WB

4 $4 $2 M u PCWrite 2 x Registers 2 Instruction Data Instruction ALU PC memory memory $2 $5 M u M x u x

4 2 2 5

M 2 4 4 u x ID/EX.RegisterRt Forwarding unit

Clock 5 after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble

Hazard ID/EX.MemRead detection 4 unit ID/EX 2 10 10 WB EX/MEM M 10 Control u M WB IF/IDWrite x MEM/WB 0 0 IF/ID EX M WB

4 $4 $4 M u PCWrite 2 x Registers

Instruction Data Instruction ALU PC memory memory $2 $2 M u M x u x

4 4 2 2

M 9 4 u x ID/EX.RegisterRt Forwarding unit 4

Clock 6 after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . . Hazard detection ID/EX.MemRead unit ID/EX 10 10 WB EX/MEM M 10 Control u M WB IF/IDWrite x MEM/WB 0 1 IF/ID EX M WB

$4 M u PCWrite x Registers

Instruction 4 Data Instruction ALU PC memory memory $2 M u M x u x

4 2

M 4 4 9 u x ID/EX.RegisterRt Forwarding unit

Clock 7 EFFECT OF A PIPELINE ON A BRANCH INSTRUCTION

Program Time (in clock cycles) execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 order (in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IMReg DM Reg

48 or $13, $6, $2 IMReg DM Reg

52 add $14, $2, $2 IMReg DM Reg

72 lw $4, 50($7) IM Reg DM Reg PIPELINED DATAPATH INCLUDING SUPPORT FOR BRANCHES

IF.Flush

Hazard detection unit M ID/EX u x WB EX/MEM M Control u M WB MEM/WB 0 x

IF/ID EX M WB

4 Shift left 2 M u x Registers = Instruction Data PC ALU memory memory M u M x u x

Sign extend

M u x Forwarding unit THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A PIPELINED BRANCH

We’ll follow what happens in the instruction sequence [40000024] sub $10, $4, $8 [40000028] beq $1, $3, $7 # PC-relative branch to offset [4000002c] and $12, $2, $5 # 40 + 4 +7*4 = 72 = 0x48 [40000030] or $14, $2, $6 [40000034] add $14, $4, $2 [4000003c] slt $15, $6, $7 ... [40000048] lw $14, 50($7)

c C. D. Cantrell (05/1999) and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>

IF.Flush

Hazard detection unit 72 ID/EX M u 48 x WB EX/MEM M Control u M WB MEM/WB 28 x 0 IF/ID EX M WB 48 44 72

4 $1 Shift M $4 left 2 u = x Registers Data Instruction ALU PC memory 72 44 memory $3 M u M $8 x 7 u x

Sign extend

10

Forwarding unit

Clock 3 lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>

IF.Flush

Hazard detection unit ID/EX M u 76 x WB EX/MEM M Control u M WB x MEM/WB 0 IF/ID EX M WB 76 72

4

Shift M $1 left 2 u = x Registers Data Instruction ALU PC memory 76 72 memory M u M $3 x u x

Sign extend

10

Forwarding unit

Clock 4 EFFECT OF AN OPTIMIZED PIPELINE ON A BRANCH INSTRUCTION

Program Time (in clock cycles) execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 order (in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

72 lw $4, 50($7) IM Reg DM Reg

IM Reg DM Reg THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BRANCH PREDICTION SCHEMES

0-bit branch prediction: Assume that the branch is not taken A common branch prediction scheme uses a branch history table . Each entry in the memory is indexed by the lower 16 bits of the address of the branch instruction . Each entry consists of a bit that is set if the branch was recently taken . If the branch is not taken, the bit is toggled . Performance shortcoming: If a branch is almost always taken (or not taken), then the bit gets toggled on a wrong prediction, and the next branch is likely to be mispredicted A 2-bit branch prediction scheme uses a branch history table in which each entry contains 2 bits to indicate the state of a branch prediction FSM (next slide) . This scheme mispredicts only once if a branch almost always goes one way c C. D. Cantrell (05/1999) PC of instruction to fetch

Look up Predicted PC

Number of entries in branch- target buffer

No: instruction is = not predicted to be Branch branch. Proceed normally predicted taken or Yes: then instruction is branch and predicted untaken PC should be used as the next PC

A branch-target buffer STATES IN A 2-BIT BRANCH PREDICTION SCHEME

Taken Not taken Predict taken Predict taken

Taken Taken Not taken Not taken Predict not taken Predict not taken

Taken

Not taken SCHEDULING THE BRANCH DELAY SLOT a. From before b. From target c. From fall through sub $t4, $t5, $t6 add $s1, $s2, $s3 add $s1, $s2, $s3 … if $s2 = 0 then if $s1 = 0 then add $s1, $s2, $s3 Delay slot Delay slot if $s1 = 0 then

Delay slot sub $t4, $t5, $t6

Becomes Becomes Becomes

add $s1, $s2, $s3

if $s2 = 0 then if $s1 = 0 then add $s1, $s2, $s3 add $s1, $s2, $s3 sub $t4, $t5, $t6 if $s1 = 0 then

sub $t4, $t5, $t6 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXCEPTIONS IN A PIPELINED PROCESSOR

Five instructions are active in any given clock period . Multiple exceptions can occur simultaneously . If execution is not stopped soon enough, the value in the register that helped cause the exception may be overwritten in the WB stage . If the branch is not taken, the bit is toggled . To ush the instructions that follow the instruction that caused the exception, we add two new signals, ID.Flush and EX.Flush . ID.Flush is ORed with the stall signal from the hazard detection unit to ush an instruction during its ID stage . To ush an instruction in its EX stage, we add an input to the PC multiplexor that sends 0x80000080 to the PC

c C. D. Cantrell (05/1999) DATAPATH WITH CONTROLS TO HANDLE EXCEPTIONS

IF.Flush ID.Flush EX.Flush

Hazard detection unit M ID/EX M 80000080 u x u WB x 0 EX/MEM M M Control u M u WB MEM/WB 0 x x 0 Cause IF/ID EX M WB

Except PC 4 Shift left 2 M u x Registers = Instruction Data PC ALU memory memory M u M x u x

Sign extend

M u x Forwarding unit THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

A PIPELINED EXCEPTION

• We’ll follow what happens in the instruction sequence [40000040] sub $11, $2, $4 [40000044] and $12, $2, $5 [40000048] or $13, $2, $6 [4000004c] add $1, $2, $1 # overflow exception occurs here [40000050] slt $15, $6, $7 [40000054] lw $16, 50($7) ... given that the instructions to execute when an exception occurs are [80000080] lui$1, -28672 [80000084] sw $2, 592($1) # sw $v0 s1 ...

c C. D. Cantrell (05/1999) lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . and $12, . . .

IF.Flush ID.Flush EX.Flush

Hazard detection unit M ID/EX M 80000080 u 0 x 0 10 u WB x 0 EX/MEM M 0 010 M 0 Control u M u WB MEM/WB 0 x x 0 0 Cause 1 IF/ID EX M WB 58 54 50 Except PC 4 Shift left 2 $6 M $2 u x

12 Registers = Instruction Data PC ALU memory $7 memory M 80000080 u M $1 x 54 u x

Sign extend

M 13 12 $1 u 15 x Forwarding unit

Clock 5 80000084 lui $1, -28672 Clock 6 80000080 IF.Flush PC 80000080 4 Instruction memory M 80000084 x u

IF/ID bubble (nop) detection Hazard Control unit extend left 2 Sign Shift 13 Registers ID.Flush 0 M x u = 0 0 0 ID/EX WB

EX

M bubble 000 00 Except Cause PC 0 M M M x u x u x u M x u 0 EX.Flush Forwarding unit ALU M x u 00 EX/MEM WB

M bubble 00 memory Data MEM/WB WB or $13, . 1 13 M x u SUPERSCALAR DATAPATH

M 80000080 u x

M u x 4

ALU

M Registers u Instruction x PC memory Write data

Data memory

Sign ALU Address extend Sign extend

ADDRESS CALC. M u x Instruction fetch In-order issue and decode unit

Reservation Reservation … Reservation Reservation station station station station

Floating Load/ Functional Integer Integer … Out-of-order execute units point Store

In-order commit Commit unit Instruction Data PC cache cache

Branch Instruction queue prediction Register file

Decode/dispatch unit

Reservation Reservation Reservation Reservation Reservation Reservation station station station station station station

Store Load Floating Branch Integer Integer Complex Load/ point integer store

Commit unit

Reorder buffer PIPELINE DEPTH vs. SPEEDUP

3.0

2.5

2.0

1.5

1.0 Relative performance 0.5

0.0 124816 Pipeline depth PIPELINED DATAPATH WITH CONTROL

Branch

IF.Flush ID.Flush EX.Flush

Hazard detection unit M ID/EX M 80000080 u x u WB x 0 EX/MEM M M Control u M u WB MEM/WB 0 x x 0 Cause IF/ID EX M WB

Except

RegWrite PC 4 Shift ALUSrc left 2 Read MemWrite Read data 1 register 1 M u Data Instruction Read x memory memory register 2 = Registers MemtoReg PC Address Write ALU Address

Instruction register Read Read Read M data Write data 2 M data u M Write x data u data x u x ALU control MemRead 16 Sign 32 extend ALUOp

Instruction [25–21] RegDst Instruction [20–16] Instruction [20–16] M Instruction [15–11] u x Forwarding unit Clock rate Slower Faster (section 5.4) Multicycle datapath Slower (instructions per clock cycle or 1/CPI) Instruction throughput (Chapter 6) Pipelined datapath (section 5.3) Single-cycle datapath Faster Hardware Shared Specialized (section 5.3) Single-cycle datapath 1 Clock cycles of latency for an instruction (section 5.4) (Chapter 6) Multicycle Pipelined datapath datapath Several THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

TYPES OF MEMORY

Non-random access . Latency dependent on addresses of last block and current block . Examples: Magnetic disk (latency = seek time + rotation time) Magnetic tape (completely sequential access) CD-ROM . Several contiguous blocks take almost the same time as one block Random access memory (RAM) . Latency independent of address . Example: Semiconductor memory Static RAM (SRAM) — Retains data as long as power is on Dynamic RAM (DRAM) — Data must be refreshed periodically

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DISK MEMORY (1)

A hard disk records bits as patterns of magnetization in a thin coating on the surface of a rigid platter . Information is read or written through a set of heads attached to the ends of an arm with multiple segments

c C. D. Cantrell (05/1999) Cylinder

Platters

Tracks

Platter

Sectors

Track

Disks are organized into platters, cylinders, tracks, and sectors THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DISK MEMORY (2)

• The latency of a magnetic disk access is given by the formula latency = seek time + rotational delay + transfer time + controller delay (not counting delays due to the fact that other processes may be using the disk) • Terminology: Seek time = time for the arm to move from its present track to the track where the requested data is located Rotational time = time for the requested sector to rotate to the position of the arm     60  30  rotational time ≤ average =  (seconds) disk rpm disk rpm Transfer time = time for data transfer from disk to main memory amount of data in MB transfer time ≥ transfer rate (in MB/s) c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DISK MEMORY (3)

• Example: latency of writing one 512-byte sector on a magnetic disk rotating at 5400 rpm, with the following parameters: Average seek time = 12 ms Transfer rate = 5 MB/s Controller delay = 2 ms 30 0.5 × 10−3MB average latency = 12 ms + s+ +2ms 5400 5 MB/s =19.7ms

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (1)

Read-only . Mask-programmed ROM . User-programmable (PROM) Read mostly . EPROM, EEPROM Read/write . Static RAM (SRAM): Semiconductor logic Fast; no need for refresh cycle Costly; hot . Dynamic RAM (DRAM): Arrays of CMOS capacitors Low cost & heat dissipation; a commodity product High capacity Slow because of complex addressing and refresh cycle c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (2)

SRAM . Access time: 5–15 ns . Price (8/96): $80–200/MB . Hierarchical design of SRAM memory: Cell design IC design SRAM memory system design SRAM cell design: . Clocked D-latch (single or master-slave conguration) . Tristate output Enable asserted: Normal latch operation Enable deasserted: Latch acts as an unconnected wire (high- impedance state)

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

CLOCK SIGNALS; SETUP AND HOLD TIMES

• Clock signals often have 50% duty cycle (time high =time low) • Clock rise and fall times  Not necessarily equal  Typically a small fraction of the clock period ◦ State transitions occur on the leading or trailing edge, during the rise or fall time interval ◦ Must be short in order to minimize setup and hold times • Data and control signals for sequential logic units must be stable for certain time intervals before and after the clock transition  Setup time =time over which a signal must be stable before the clock transition  Hold time =time over which a signal must be stable after the clock transition

c C. D. Cantrell (11/1999) Illustration of setup and hold times

1 cycle

CLK

Address latched here

ADDR A

Setup Hold Write Enable WE

Data latched here Data Input DI DI THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (3)

• Memory cells are accessed as an array One-dimensional array (example: register file) 5 Read register Write number 1 enable C Register 0 0 Register 0 Register 1 M 32 1 D u Read data 1 n-to-2n C Register n Ð 1 x Register number decoder Register 1 (n bits) Register n D 5 2n Ð 2 Read register 2n Ð 1 number 2

M C 32 Register 2n Ð 2 u Read data 2 x D C n 32 Register 2 Ð 1 Register data D

READ ACCESS WRITE ACCESS ◦ Read access to a specific register is afforded through a multiplexor  Select signal is the register address (register number) ◦ Write access to a specific register is obtained by decoding the register address, which asserts that register’s write enable signal c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (4)

• One-dimensional array of multi-bit memory cells Each cell contains a fixed number of bits A decoder selects the row that is addressed When a word line is enabled, data appears on all of the bit lines The desired data is selected by a multiplexor BIT LINES Each dot represents a bit

W O R D L I N E S

c C. D. Cantrell (11/1999) 3 to 0 − Columns n omitted for clarity

4 × n SRAM (read cycle) Reading from SRAM

1 cycle

CLK 1 Address latched here ADDR 2 A0 A1 A2

Setup Hold Chip Select CS or SS 3 7

Write Enable (WE) Setup 4

Output Enable (OE) 5

Data Output (DQ) 6 DQ0 DQ1 DQ2

Note: DQ0 is the data associated with Address 0 (A0). DQ1 is the data associated with Address 1 (A1). 3 to 0 − Columns n omitted for clarity

4 × n SRAM (write cycle) Writing to SRAM

1 cycle

CLK 1

Address latched here

ADDR 2 A0 A1 A2

Chip Select CS 3

Setup Hold Write Enable WE 4 Data latched here Data Input DI1 DI2 DI 5 DI0

Note: DI0 is written to the memory location with address 0 (A0). THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (5)

• SRAM memory system design

No. of words in memory system No. of rows of ICs = No. of words per IC No. of bits per word in memory system No. of columns of ICs = No. of bits per word per IC Addressing: ◦ No. of rows of ICs determines the no. of address bits into the decoder • Example: A 4M×32 bit (4 194 304×32 bit) SRAM memory system built from 2M×8 bit SRAM ICs 4 ⇒ No. of rows of ICs = 2 =2 1-to-2 decoder 32 No. of columns of ICs = 8 =4

c C. D. Cantrell (02/1999) 9-to-512 512 × 64 512 × 64 512 × 64 512 × 64 512 × 64 512 × 64 512 × 64 512 × 64 decoder 512 SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM Address [14–6]

64 Address [5–0]

Mux Mux Mux Mux Mux Mux Mux Mux

Dout7 Dout6 Dout5 Dout4 Dout3 Dout2 Dout1 Dout0 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (6)

• Two-dimensional array of memory cells Implement as a 1-dimensional array of multi-cell words Each cell contains a fixed number of bits Separate decoders select the row and column of the cell that is addressed When a word line is enabled, data appears on all of the bit lines BIT LINES Each dot represents a memory cell

W O R D L I N E S

c C. D. Cantrell (11/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (7)

• On a memory IC that stores megabits of data, there are thousands of row and column select lines Too many select lines to pin out There are too many outputs for a multiplexor of practical size and delay Solution: Use separate decoders for the rows and columns Assume a chip with M rows and N columns ◦ The M least significant bits of the main memory address determine the row ◦ The next N bits determine the column

c C. D. Cantrell (11/1999) 80

16 Mb 70

60

50

4 Mb 1 Mb Dollars per 40 DRAM chip 256 Kb

30 Final chip cost 64 Kb 20

10 16 Kb

0

1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995

Year

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (8)

• DRAM memory cell design MOS technology At the intersection of each bit line and each word line, a capacitor Ccell is the memory unit; a transistor functions as a switch

◦ Word line asserted: Switch open ⇒ Ccell connected to bit line ◦ Write operation: Value to be written (0 or 1) asserted on bit line  Value = 1 (0) ⇒ Ccell charged (discharged) ◦ Read operation: Bit line precharged to halfway voltage, Vpre  Word line activated ⇒ Ccell discharges onto (charges from) line  Voltage change

Ccell ∆V =(Vbit − Vpre) ≈ 250 mV Ccell + Cline detected by the sense amplifier for the bit line

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

1-TRANSISTOR DRAM CELL

Bit line

Word line Pass transistor

Capacitance of Ccell DRAM cell

Stray capacitance of bit line Cline

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (9)

• DRAM memory cell physics Charge stored on a 2-conductor capacitor with potential difference V between conductors: Q = CV Capacitance of two parallel plates with area A and separation d A: A C = d Conclusions: ◦ To make Q as large as possible for a given V , one must increase A and while decreasing d  = 0 r, where r is a material property (hard to change!)  The area A can be increased without increasing the footprint of a memory cell by using trench or stack geometries

c C. D. Cantrell (02/1999) STACK AND TRENCH DRAM CELL GEOMETRIES THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (10)

• DRAM IC design Hierarchical organization (an array of arrays of words, etc.) Tristate outputs permit construction of multi-IC memory systems Data in/data out pins are shared in order to reduce pin count ◦ Access mode (read or write) is selected by Output Enable and Write Enable signals Row and column address pins are also shared ◦ First the Row Address Strobe (RAS) samples the row address ◦ The row address is latched ◦ Then the ColumnAddress Strobe (CAS) samples the column address Shared I/O and address lines, plus refresh cycle ⇒ DRAM initial latency ≈ 5–10 × SRAM latency

c C. D. Cantrell (02/1999) FUNCTIONAL BLOCK DIAGRAM - 4K REFRESH

WE# DATA-IN 4 CAS# DQ1 BUFFER DQ2 DQ3 DQ4 NO. 2 CLOCK DATA-OUT 4 BUFFER GENERATOR 4

OE#

COLUMN COLUMN ADDRESS 10 A0 10 DECODER BUFFER(10) A1 1024 4 A2 REFRESH A3 CONTROLLER SENSE AMPLIFIERS A4 I/O GATING A5 A6 1024 REFRESH A7 COUNTER A8 A9 12 4096 x 1024 x 4 4096 A10 ROW MEMORY A11 ARRAY 12 ADDRESS 12 4096 BUFFERS (12) ROW 4096 (1 of 4096) DECODER SELECT ROW SELECT COMPLEMENT

NO. 1 CLOCK V DD RAS# GENERATOR Vss DRAM Read Cycle

tRC

tRAS tRP

VIH RAS VIL

tCSH t tRCD tRSH CRP

VIH CAS tCAS VIL

tRAD tRAL tCAL

tASR tASC

tRAH tCAH

VIH Address Row Column VIL

tRCH t RCS tRRH

VIH WE VIL tAA

VIH tOEA OE VIL

tDZC tCDD

tDZO tOED VIH

DIN Hi-Z VIL tCAC tOFF

tCLZ tOEZ

VOH

DOUT Hi-Z Valid Data Out Hi-Z VOL

tRAC tOH

: “H”: or “L” tOHO DRAM Write Cycle (Early Write)

tRC

tRAS tRP

VIH RAS VIL

tCSH

tRCD tRSH tCRP

VIH tCAS CAS VIL

tRAD

tASR tASC

tRAH tCAH

VIH Address Row Column VIL

tWCS tWCH V IH t WE WP VIL

VIH OE VIL

tDS tDH

VIH DIN Valid Data In VIL

VOH DOUT Hi-Z VOL

: “H” or “L” THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (11)

• Classic DRAM access mode (RAS and CAS strobed for each bit)

Row access time = tRAC ≈ 60 ns Cycle time = tRC ≈ 2tRAC = time between successive accesses • DRAM burst modes (typically one burst = 4 contiguous accesses) Fast page mode (FPM) DRAM IC has a register to hold a row ◦ RAS needs to be strobed only once ◦ CAS must be strobed for each cell accessed (see next slide) Extended data out (EDO) is basically a more useful FPM Data is not disabled when CAS goes high (see next slide) ◦ Valid data remains available until data from next access appears Pipelined burst mode ◦ Memory is interleaved using banks on the DRAM IC ◦ The DRAM IC generates addresses to be fetched ◦ Eliminates cycle time c C. D. Cantrell (05/1999) VIH RAS# VIL tPC tCAS tCP tCAS tCP tCAS

CAS# VIH VIL

t tASC CAH tASC tCAH tASC tCAH

V ADDR IH VIL ROW COLUMN COLUMN COLUMN ROW

WE# VIH V IL tAA tAA tAA tCPA tRAC tCPA tCAC tCAC tCAC t tOFF tOFF CLZ tOFF tCLZ

V DQ OH OPEN VALID VALID VALID OPEN VOL DATA DATA DATA tOE tOD

V OE# IH VIL

DON’T CARE FPM DRAM PAGE MODE CYCLE UNDEFINED VIH RAS# VIL tPC tCAS tCP tCAS tCP tCAS

CAS# VIH VIL tAR

tASC tCAH tASC tCAH tASC tCAH

V ADDR IH VIL ROW COLUMN COLUMN COLUMN ROW

WE# VIH V IL tAA tAA tAA tCPA tRAC tCPA tCAC t t CAC CAC tCLZ tOFF t tOEHC tCLZ COH

V DQ OH OPEN VALID VALID VALID OPEN VOL DATA DATA DATA t t OE tOD OE tOD t OES t VIH OES OE# V IL tOEP

DON’T CARE EDO DRAM PAGE MODE CYCLE UNDEFINED V RAS IH VIL

VIH CAS VIL tOFF tOFF tCAC MAX tCAC MAX

t t tCLZ OFF tCLZ OFF MIN MIN V DQ IOH VALID DATA VIOL OPEN VALID DATA

(a) Conventional FPM

V RAS IH VIL

VIH CAS VIL

tCAC tCAC

tCLZ tCOH V DQ IOH VALID DATA VIOL OPEN VALID DATA

(b) EDO UNDEFINED PHYSICAL DIFFERENCE BETWEEN FPM AND EDO SDRAM block diagram (2Mbit x 4 I/O x 2 banks) Row Decoder

2048 x 1024 2048 Memory Bank A

CKE CKE Buffer Self Refresh Clock 1024

Row Address Counter Sense Amplifiers Bank A Row/Column Column Decoder and DQ Gate CLK CLK Buffer Select 4 8 11 Predecode A A0 A1 Sequential 3 Data Latches A2 12 Control A3 Bank A DQ0 A4 DQ1 A5 DQ2 A6 8 A7 DQ3 A8 12 11 Mode Register Data Input/Output Buffers A9 (12) Address Buffers 8 A10 A11 (BS) Sequential Control CS CS Buffer 3 Bank B Data Latches

11 Predecode B 8

RAS Buffer RAS Bank B Row/Column Column Decoder and DQ Gate Select Sense Amplifiers

CAS CAS Buffer Command Decoder 1024

WE WE Buffer 2048

2048 Memory Bank B 2048 x 1024 DQM DQM Buffer Row Decoder Row Pipelined burst read operation

T0T1 T2 T3 T4 T5 T6 T7 T8

CLK

COMMAND READ A NOP NOP NOP NOP NOP NOP NOP NOP

CAS latency = 1 DOUT A0 DOUT A1 DOUT A2 DOUT A3 DOUT A4 DOUT A5 DOUT A6 DOUT A7 tCK1, DQs THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RANDOM-ACCESS MEMORY (12)

• Because 1-transistor DRAMs store bits as charges on capacitors, the charges leak away and therefore must be refreshed periodically • DRAM refresh cycles: The Joint Electronics Design Engineering Council (JEDEC) has approved two refresh types for 4M × N DRAMs ◦ One version of a 4M × 4 DRAM requires 10 column-address bits and 12 row-address bits  212 = 4096 (4K) rows are refreshed every 64 ms ◦ The other version for the same capacity DRAM requires 11 row- address bits and 11 column-address bits  211 = 2048 (2K) rows are refreshed every 32 ms  4K refresh chips draw less current per refresh than 2K refresh chips, because the page depth (number of columns) is 210 = 1024 in the 4K refresh chips vs. 211 = 2048 in the 2K refresh chips c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SUPERCOMPUTER MEMORY SIZES

Computer Date Memory Size (MB) Technology CDC 6600 1964 0.75 ferrite cores CDC 7600 1969 0.25 ferrite cores CRAY-1 1976 4 ECL CRAY X-MP/48 1984 64 ECL CRAY Y-MP/864 1988 512 ECL CRAY Y-MP C916 1993 8192 BiCMOS

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WAYS TO MATCH CPU AND MEMORY SPEEDS

Solutions to the problem that CPU clock period 0.05–0.1 DRAM access time . Interleave the memory system Permits sequential reads or writes to main memory on successive CPU clock periods Cray’s solution: interleaved SRAM memory ($$$$$!) . Hierarchical memory Keep recently-used data and instructions (and items with nearby addresses) in fast memory (cache) Fetch data and instructions from main memory if not in cache Other examples of the same approach: Main memory (SRAM) and Solid State Disk (DRAM) (CRAY) Main memory (faster) and magnetic disk (slower) Magnetic disk (faster) and magnetic tape (slower) c C. D. Cantrell (02/1999) 3000

2000

CPU curve changes slope due to the RISC revolution 1000

100 Performance

10

0

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Year

Memory CPU

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition R4000 CPU

Registers Registers Registers

I-cache D-cache

Primary Cache Caches

S-cache Faster Access Increasing Data Time Capacity

Main Memory Memory

Disk, CD-ROM, Tape, etc. Peripherals

Logical Hierarchy of R4000 Memory (a) One-word-wide (b) Wide memory organization (c) Interleaved memory organization memory organization

CPU CPU CPU

Multiplexor Cache Cache Cache

Bus Bus Bus

Memory Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3

Memory

Three techniques for achieving higher memory bandwidth Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERLEAVED MEMORY (1)

Goal: Eliminate (or greatly reduce) the eects of DRAM cycle time, permitting memory accesses in successive clock periods . Physical memory is partitioned into 2n banks of equal size . The address of the bank in which the byte or word at address M is stored is the least signicant n bits of M: bank address word address (mod 2n) . The number of banks should be no smaller than the number of clock periods in one memory cycle time Example: If the bank cycle time is 6 clock periods, then at least 2n = 8 banks should be used (8-way interleaving) . Stride = 1 + number of words between successively accessed words . Bank conicts occur whenever the stride in memory has a common factor with the number of banks bank accessed before it’s ready ⇒ Example: With 8 banks, a bank cycle time of 6 clock periods, and a stride of 10, a bank conict occurs on every 4th access c C. D. Cantrell (05/1999) Address Bank 0 Address Bank 1 Address Bank 2 Address Bank 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Address = dd002 Address = dd012 Address = dd102 Address = dd112

Four-way interleaved memory (assuming 4-bit addresses)

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition Pipelined read operation with 4 memory banks (B0, B1, B2, B3)

T0 T1 T2 T3 T4 T5 T6 T7 T8

CLOCK

COMMAND READ A NOP NOP NOP NOP NOP NOP NOP NOP

DQs DOUT B0 DOUT B1 DOUT B2 DOUT B3 DOUT B0 DOUT B1 DOUT B2 DOUT B3 THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERLEAVED MEMORY (2)

Memory organization of a CRAY J90 or C90 processor . 64 memory banks Consecutive words in address space stored in consecutive banks Bank cycle time: 6 clock periods . 3 vector ports to memory in J90, 4 in C90 . In every clock period, two operands can be loaded from memory and a result can be stored to memory

c C. D. Cantrell (05/1999) Vector pseudocode Scalar pseudocode

THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (1)

Goal: Create the illusion that the processor has access to a huge amount of very fast memory . Must be an illusion because price speed: ∝ Technology Access Time (ns) $ per MByte SRAM 5–15 80–200 DRAM 60–70 8–20 Magnetic Disk 107 .18–.40 . Strategy: Keep most frequently used data & instructions in fastest memory Put less frequently used items in next lower level of memory Use locality in space and time as initial criteria for deciding what to put in fast memory

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (2)

Memory references in running processes or threads tend to obey the Principles of Locality: . If a data item or an instruction is accessed, it is likely to be used again soon (temporal locality) . If a data item or an instruction is referenced, a data item or an instruction that is stored at a nearby address in main memory is likely to be accessed soon (spatial locality) Keep most frequently used data & instructions in fastest memory Put less frequently used items in next lower level of memory Use locality in space and time as initial criteria for deciding what to put in fast memory . Locality generally is not the same for data and instructions . Example: matrix multiplication with a very large stride in memory Instructions are highly local (inner loop), but data is not local c C. D. Cantrell (05/1999) Processor

Cache

block data transfers

Main Memory THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (3)

Terminology . A hit occurs in an upper level of hierarchical memory when the data requested is in a block in the upper level Hit rate = fraction of accesses found in upper level Hit time = time to access upper level + hit/miss decision time . A miss occurs in an upper level of hierarchical memory when the data requested is not in a block in the upper level Data must be obtained from a block in the lower level Instruction processing stops until the data is fetched Then the processor restarts the access Miss rate = 1 hit rate Miss penalty = Time to get block from lower level + Time to place block in upper level + Time to deliver data to the processor Hit time miss penalty c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (4)

The levels of hierarchical memory between the processor and main memory are called cache . Level 1 cache is on the same die as the processor Size is limited by available real estate (8–64 kB) May contain: Instructions only (common in processors intended for use in embedded systems) Data only (not common, because instructions have better locality than data) Instructions and data in same cache Instructions and data in separate caches . Level 2 cache is implemented in SRAM (usually on a separate card or chip) Used with all RISC processors Size usually 1 MB per processor c C. D. Cantrell (05/1999) Processor IC

L L 1 2 C C Memory CPU bus I/O Bus a a Memory I/O devices Registers c c h h e e Disk memory Register L1 cache L2 cache Memory reference reference reference reference reference

Size: 200 B 64 KB 1 MB 128 MB 12 GB Speed: 2 ns 4 ns 8 ns 60 ns 5 ms

Levels in a typical memory hierarchy Instruction Unit

Branch Processing Fetcher Unit Additional Features 128-Bit BTIC CTR (4 Instructions) ¥ Time Base Counter/Decrementer Instruction Queue (6 Word) 64 Entry LR ¥ Clock Multiplier ¥ JTAG/COP Interface ¥ Thermal/ BHT SRs ¥ Performance Monitor (Shadow) Instruction MMUIBAT 32-Kbyte Array Tags I Cache ITLB 2 Instructions

Dispatch Unit 64-Bit (2 Instructions)

Reservation Station Reservation StationReservation Station Reservation Station Reservation Station (2 Entry) FPR File GPR File Rename Buffers (6) System RegisterRename 32-BitBuffers Load/Store Unit Floating-Point Integer Unit 1 Unit (6) 64-Bit 64-Bit Unit Integer Unit 2 (EA Calculation)+ + x ÷ + x ÷ + 32-Bit CR Store Queue FPSCR 32-Bit

PA EA Completion Unit 64-Bit 60x Bus Interface Unit Reorder Buffer Data MMU (6 Entry) SRs Instruction Fetch Queue (Original) DBAT L1 Castout Queue L2 Bus Interface Array 32-Kbyte Unit Tags D Cache DTLB Data Load Queue L2 Castout Queue L2 Controller L2CR 32-Bit Address Bus L2 Tags 64-Bit Data Bus Not in the MPC740 17-Bit L2 Address Bus 64-Bit L2 Data Bus

PPC750 Microprocessor Block Diagram THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (5)

Implementations of L2 cache . On the same system bus as the processor and main memory L2 cache is accessed at the same frequency as main memory L2 cache SRAMs are on a card that plugs into a special slot on the logic board . On a separate cache bus Requires a separate bus interface on the processor L2 cache is accessed at the processor clock frequency, or 1 (especially in portables) at 2 the processor clock frequency L2 cache SRAMs are on the same card (or in the same module) as the processor L2 cache on the side opposite to the processor is called backside cache L2 cache on the same side as the processor is called inline cache c C. D. Cantrell (05/1999) L2ADDR[16Ð0] ADDR[16Ð0] L2DATA[0Ð63] DATA[0Ð31] L2DP[0Ð7] PARITY[0Ð3] L2CE(Optional) E

L2WE W (Optional) L2ZZ 0 ADSC 1 ADSP 128k x 36 ZZ SRAM L2CLK_OUTA K

MPC750 ADDR[16Ð0] L2SYNC_OUT DATA[0Ð31] L2SYNC_IN PARITY[0Ð3] E 0 1 W 128k x 36 ADSC SRAM L2CLK_OUTB ADSP ZZ (Optional) K

Typical PPC 750 1-Mbyte L2 Cache Configuration L2 Cache System Bus

Bus Interface Unit

L1 ICache L1 DCache

Fetch Load Store

Fetch/ Dispatch/ Retire Decode Execute Unit Unit Unit

Instruction Pool

P6 processor interface to memory THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (6)

Parameters that the cache designer can vary to optimize the design for a particular objective: . Cache size (the amount of data that the cache can hold) . Cache line size (block size) The line size is the amount of data loaded from the next lower level in the event of a cache miss (requested data not in cache) . Cache associativity (see later slides) . Cache write strategy Does newly computed data get written back to main memory on every write (write through) or only when the cache line containing the data is invalidated (write back)? On a write miss, does the cache line containing the requested data get read into the cache (write allocate) or written only to memory (write around)?

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (7)

Systems approach to data transfers between two levels of hierarchical memory (Amdahl’s law again!): . Logically organize all data stored in the lower level into blocks . Store the most-frequently-used blocks in the upper level Hit rate H = fraction of memory references to data in the upper level Miss rate =1 H; upper-level latency = T ; u miss penalty = Tm; lower-level latency = Tl Average memory latency: T = HT +(1 H)T e u m Speedup: T T /T S = l = l m T 1 H(1 T /T ) e u m

c C. D. Cantrell (02/1999) CACHE MISS RATE vs. LINE SIZE

40%

35%

30%

25%

20%

Miss rate 15%

10%

5%

0% 4 16 64 256 Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (8)

Direct-mapped cache organization: . Main memory has 2m bytes, cache has 2c bytes (c

upper address bits (tag) index byte offset 32 Ð 5 = 27 bits 3 bits 2 bits index index

000 X4 000 X4

001 X1 001 X1

010 Xn – 2 010 Xn – 2

011 011

100 Xn – 1 100 Xn – 1

101 X2 101 X2

110 110 Xn

111 X3 111 X3

a. Before any reference to Xn b. After the first reference to Xn

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition DIRECT-MAPPED CACHE WITH EIGHT 4-BYTE LINES (2)

Cache 001 000 010 011 111 101 100 110

00001 00101 01001 01101 10001 10101 11001 11101

Word addresses in main memory

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition DIRECT-MAPPED CACHE WITH 1024 LINES

Address (showing bit positions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Hit Data Tag Index

Index Valid Tag Data 0 1 2

1021 1022 1023 20 32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition DECSTATION 3100 CACHE WITH 16384 LINES

Address (showing bit positions) 31 30 17 16 15 5 4 3 2 1 0

16 14 Byte offset Hit Data

16 bits 32 bits

Valid Tag Data

16K entries

16 32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition 64-KB CACHE WITH 16-BYTE LINES

Address (showing bit positions) 31 16 15 43210

16 12 2 Byte Hit Data Tag offset Index Block offset 16 bits 128 bits V Tag Data

4K entries

16 32 32 32 32

Mux 32

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (9)

Fully-associative cache organization: . Main memory has 2m bytes, cache has 2c bytes (c

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HIERARCHICAL MEMORY (10)

Set-associative cache organization: . Main memory has 2m bytes . A2s-way set-associative cache has 2c bytes (c

Four-way set associative (s = 2) Set Tag Data Tag Data Tag Data Tag Data 0 1

Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data PLACEMENT OF MEMORY BLOCK 12

Block address (30 bits) Byte Tag Index offset

Direct mapped 2-way set associative Fully associative Block # 01234567 Set # 0123 Block 12 can go anywhere

Data Data Data

1 1 1 Tag Tag Tag 2 2 2

Search Search Search

Block address = 0...01100 Block address = 0...01100 index = 100 (bits 4–2) index = 00 (bits 3–2) tag = bits 31–5 tag = bits 31–4

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition 4-WAY SET-ASSOCIATIVE CACHE

Address 3130 12 11 10 9 8 3 2 1 0

22 8

Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2

253 254 255 22 32

4-to-1 multiplexor

Hit Data After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition Block 1 CPU Block address offset address <21> <8> <5> Data Data Tag Index in out 4 Valid Tag Data <1> <21> <256>

(256 2 blocks)

=? 3 4:1 Mux

Write buffer

Lower level memory

The organization of the data cache in the Alpha AXP 21064 microprocessor Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

STRATEGIES FOR CACHE WRITES

Write-back . When a write occurs, data is written only to the block in the cache The block in the lower level of memory is replaced only when the corresponding block in the cache is replaced by another block The processor can write at cache speed instead of memory speed High speed block transfers can be used eectively Write-through . When a write occurs, data is written both to the block in the cache and to the block in the lower level of memory . Must use a write buer to compensate for CPU/memory speed dierence . Minimizes the cache miss penalty because the block being replaced has already been written to memory . Impractical in virtual memory systems (enormous latency of a write to a block that is on disk) c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THREE C’s MODEL

Goal: Understand the sources of cache misses In this model, cache misses are classied into one of the following categories: . Compulsory (cold-start) misses The referenced block was never in the cache . Capacity misses: The cache cannot contain all of the blocks needed by a process Signaled by blocks being replaced, then retrieved again For an example, see slides on the computational eects of stride in memory . Conict misses: Cache misses that occur in direct-mapped or set- associative caches because more than one block needs to be stored in a given set

c C. D. Cantrell (05/1999) 14%

12%

10%

8% Increases in conflict miss rates

6% Miss rate per type

4%

2% Capacity miss rate Compulsory miss rate

0% 12 4 8 163264128

Cache size (KB) One-way Four-way Two-way Eight-way THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (1)

Stride = 1 + number of words between successively accessed words . Usually used to refer to array accesses . If stride = 1, then words are accessed successively (unit stride) (since the number of intervening words = 0) . Hierarchical memory: If stride > cache line size (in words), then there is no data locality in a practical sense For large strides, each array access usually incurs the full latency of the rst word accessed Small benchmark programs give unrealistically high performance estimates . Interleaved memory: If the stride is not relatively prime to the number of memory banks, there will be problem sizes at which performance is sharply lowered c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (2)

• Multiplication of n × n matrices: C = AB n  Component form: cik = Σj=1 aijbjk  Different loop orderings in matrixmultiplication correspond to different points of view about the organization of a matrix  Dot-product interpretation: cik = matrixproduct of row i of A with column k of B  Dyadic interpretation: C = sum of dyadics of the form (matrixproduct of column j of A with row j of B)  Gaxpy (αx + y) interpretation: C = sum of dyadics, each of which contains a linear combination of columns of A or rows of B • For computational purposes, there are 3 loops (on i, j, and k)  Dot-product interpretation applies when the loop on j is innermost

 When the i loop is innermost, it computes bjk× (column j of A)

 When the k loop is innermost, it computes aij× (row j of B) c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MATRIX MULTIPLICATION

• Equations for multiplication of n × n matrices:     n n n n n n n n n     i k  i j   l k i j k C = Σ Σ c ei = AB =  Σ Σ a ei   Σ Σ b el  = Σ Σ Σ a bkei i=1 k=1 k i=1 j=1 j k=1 l=1 k i=1 j=1 k=1 j         n n  n   n      k  i j   l      = Σ Σ ei  Σ a    Σ b el dot product form (ikj and kij)  j   k  i=1 k=1  j=1  l=1     row i  col k           n  n  n       j  i  k = Σ  Σ bk  Σ a ei   gaxpy-dyadic form (kji)   j   k=1 j=1  i=1       col j          n  n   n       i   j k     = Σ  Σ ajei  Σ bk  dyadic form (jki and jik) j=1  i=1   k=1 

col j  row j        n  n  n      i  j k    = Σ ei  Σ aj  Σ bk  gaxpy-dyadic form (ijk) i=1 j=1  k=1  row j c C. D. Cantrell (10/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (3)

• The operations in the computation of a matrixproduct C = AB are /* (ijk) loop */ for (i=0; i

c C. D. Cantrell (05/1999) /* matmul.c */ /* matrix multiplication benchmarking program */

#include #include #define MSIZE 300 main() { int i,j,k; int n = MSIZE; int n3; /* lda must be set to MSIZE + 1 */ static int lda = MSIZE + 1 ; float norma, normb; static double a[MSIZE][MSIZE]; static double b[MSIZE][MSIZE]; static double c[MSIZE][MSIZE]; static double v[MSIZE]; double time[6]; double speed[6]; double starttime,dtime(); static char *order[6] = {"ijk","ikj","jki","kij","kji","jik"};

/* Generates matrices of pseudorandom numbers*/

matgen(a,lda,n,v,&norma) ; matgen(b,lda,n,v,&normb);

/*for (i=0; i

} }

for (i=0; i

/* zero the matrix product */ for (i=0; i

/* (ijk) loop */ for (i=0; i

/* calculate the time for loop 0 */ time[0] = dtime() - starttime;

/*printf("\n loop ijk\n");*/

/* zero the matrix product */ for (i=0; i

/* start the clock for loop 1 (ikj) */ starttime = dtime() ; /* (ikj) loop */ for (i=0; i

/* calculate the time for loop 1 (ikj) */ time[1] = dtime() - starttime ;

/*printf("\n loop ikj\n"); */ /* zero the matrix product */ for (i=0; i

/* start the clock for loop 2 (jki) */ starttime = dtime();

/* (jki) loop */ for (j=0; j

/* calculate the time for loop 2 (jki) */ time[2] = dtime() - starttime ;

/*printf("\n loop jki\n");*/ /* zero the matrix product */ for (i=0; i

/* start the clock for loop 3 (kij) */ starttime = dtime() ;

/* (kij) loop */ for (k=0; k

/* calculate the time for loop 3 (kij) */ time[3] = dtime() - starttime ;

/*printf("\n loop kij\n");*/

/* zero the matrix product */ for (i=0; i

/* start the clock for loop 4 (kji) */ starttime = dtime() ;

/* (kji) loop */ for (k=0; k

/* calculate the time for loop 4 (kji) */ time[4] = dtime() - starttime ;

/*printf("\n loop kji\n");*/

/* zero the matrix product */ for (i=0; i

/* start the clock for loop 5 (jik) */ starttime = dtime() ;

/* (jik) loop */ for (j=0; j

/* calculate the time for loop 5 (jik) */ time[5] = dtime() - starttime ;

/*printf("\n loop jik\n"); for (i=0; i

/* compute the speed in MFLOPS */

n3=n*n*n;

for(i=0; i<6; i++) { speed[i]=((1.0e-6)*n3)/time[i]; }

printf("\n\n"); printf("loop order speed (MFLOPS) time (seconds)\n"); for (i=0;i<6;i++) { printf("%s %e %e \n", order[i],speed[i],t } return 0; }

/*------*/ matgen(a,lda,n,b,norma) double a[],b[],*norma; int lda, n;

/* We would like to declare a[][lda], but c does not allow it. In th function, references to a[i][j] are written a[lda*i+j]. */

{ int init, i, j;

init = 1325; *norma = 0.0; for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { init = 3125*init % 65536; a[lda*j+i] = (init - 32768.0)/16384.0; *norma = (a[lda*j+i] > *norma) ? a[lda*j+i] : *norma; } } for (i = 0; i < n; i++) { b[i] = 0.0; } for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { b[i] = b[i] + a[lda*j+i]; } } }

/*------*/

/*Pick which time function*/

/************************************************/ /* Sun Solaris POSIX dtime() routine */ /* Provided by: Case Larsen, CTLarsen.lbl.gov */ /************************************************/ /*#include #include #include

struct rusage rusage; double dtime() { double q;

getrusage(RUSAGE_SELF,&rusage);

q = (double)(rusage.ru_utime.tv_sec); q=q+(double)(rusage.ru_utime.tv_sec) * 1.0e-09;

return q; }*/ /*****************************************************/ /* UNIX dtime(). This is the preferred UNIX timer. */ /* Provided by: Markku Kolkka, [email protected] */ /* HP-UX Addition by: Bo Thide', [email protected] */ /*****************************************************/ #include #include struct rusage rusage; double dtime() { double q;

getrusage(RUSAGE_SELF,&rusage);

q=(double)(rusage.ru_utime.tv_sec); q=q+(double)(rusage.ru_utime.tv_usec) * 1.0e-06;

return q; } THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (4)

FORTRAN loop orderings jki and kji: In the innermost (i)loopin c(i,k) = c(i,k) + a(i,j) * b(j,k), the matrix elements c(i,k) and a(i,j) are accessed successively by rows, which implies stride 1 given the FORTRAN storage order:

ar2(1,1) ar2(1,2) ar2(1,3) ar2(2,1) ar2(2,2) ar2(2,3) ar2(3,1) ar2(3,2) ar2(3,3) ar2(4,1) ar2(4,2) ar2(4,3)

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (5)

FORTRAN loop orderings ijk and jik: In the innermost (k)loopin c(i,k) = c(i,k) + a(i,j) * b(j,k), c(i,k) and b(j,k) are accessed successively by columns, which implies stride n given the FORTRAN storage order:

ar2(1,1) ar2(1,2) ar2(1,3) ar2(2,1) ar2(2,2) ar2(2,3) ar2(3,1) ar2(3,2) ar2(3,3) ar2(4,1) ar2(4,2) ar2(4,3)

The following plot was made on an Ultra 10 workstation with 512 kB L2 cache . Cache line size = 16 B

c C. D. Cantrell (05/1999) Speed vs. FORTRAN loop order

4.5

4

3.5

3

600 2.5 500 Speed (MFLOPS) 400 2 300 200 1.5 100

1 100 200 0.5 Cache 300 Size 400 Matrix 0 Dimension 500 ijk ikj jki 600 kij kji Loop order jik THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (6)

C loop orderings ijk and jik: In the innermost (k)loopin c[i][k] = c[i][k] + a[i][j] * b[j][k], the matrix elements c[i][k] and b[j][k] are accessed successively by columns, which implies unit stride given the C storage order:

ar2[0][0] ar2[0][1] ar2[0][2]

ar2[1][0] ar2[1][1] ar2[1][2]

ar2[2][0] ar2[2][1] ar2[2][2]

ar2[3][0] ar2[3][1] ar2[3][2] c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EFFECTS OF STRIDE ON MEMORY ACCESSES (7)

C loop orderings jki and kji: In the innermost (i)loopin c[i][k] = c[i][k] + a[i][j] * b[j][k], the matrix elements c[i][k] and a[i][j] are accessed successively by rows, which implies stride n given the C storage order

ar2[0][0] ar2[0][1] ar2[0][2]

ar2[1][0] ar2[1][1] ar2[1][2]

ar2[2][0] ar2[2][1] ar2[2][2]

ar2[3][0] ar2[3][1] ar2[3][2]

c C. D. Cantrell (05/1999) Speed vs. C loop order (static arrays)

4

3.5

3

2.5 600 500 Speed (MFLOPS) 2 400 300 200 1.5 100

1 100 0.5 200 Cache 300 Size 400 Matrix 0 Dimension 500 ijk ikj jki 600 kij kji Loop order jik Speed vs. C loop order (pointers)

4

3.5

3

2.5 600 500 Speed (MFLOPS) 2 400 300 200 1.5 100

1 100 0.5 200 Cache 300 Size 400 Matrix 0 Dimension 500 ijk ikj jki 600 kij kji Loop order jik THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

WHO MANAGES HIERARCHICAL MEMORY?

The user . No virtual memory . Program overlays may be needed . The user must make explicit transfers of data blocks among levels . Context switching is accomplished by swapping entire processes . Requires very fast backup of memory (e.g., Cray’s SSD) . Optimized for single-user mode The operating system . Virtual memory In most OSs, VM is optimized for rotating magnetic disks Solid state disks may work best with other algorithms . Context switching is performed by the kernel . Optimized for multi-user mode

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VIRTUAL MEMORY

Useful in a multiprogramming environment . Each process has its own logical address space . Logical to physical address translation on every memory reference . The operating system protects each process’s physical address space Strategy: Main memory functions as a “cache” for magnetic disk . Keep most recently used data & instructions in main memory Size of “cached” blocks: . All microprocessors permit xed-size blocks (pages) Fixed size & relocation eliminate memory fragmentation A miss is a page fault (penalty 105 cache miss penalty) . 80x86 architecture also permits variable-size blocks (segments) Address translation and memory fragmentation are problems

c C. D. Cantrell (02/1999) 105 SRAM (chip) DRAM 1980 (board) 1980 104 1985 1985 DRAM (chip) 103 1990 1980 Cost 1990 ($/MB) 1995 1985 102 Access time gap Disk 1980 1990 1995 101 1995 1985

1990 1995 100 100 101 102 103 104 105 106 107 108 109

10-1

Access time (ns)

Cost versus access time for SRAM, DRAM, and magnetic disk in 1980, 1985, 1990, and 1995 Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition Typical ranges of parameters for caches and virtual memory

Parameter First-level cache Virtual memory Block (page) size 16Ð128 bytes 4096Ð65,536 bytes Hit time 1Ð2 clock cycles 40Ð100 clock cycles Miss penalty 8Ð100 clock cycles 700,000Ð6,000,000 clock cycles (Access time) (6Ð60 clock cycles) (500,000Ð4,000,000 clock cycles) (Transfer time) (2Ð40 clock cycles) (200,000Ð2,000,000 clock cycles) Miss rate 0.5Ð10% 0.00001Ð 0.001% Data memory size 0.016Ð1MB 16Ð8192 MB

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition Paging versus segmentation

Page Segment Words per address One Two (segment and offset) Programmer visible? Invisible to application programmer May be visible to application programmer Replacing a block Trivial (all blocks are the same size) Hard (must find contiguous, variable-size, unused portion of main memory) Memory use Internal fragmentation (unused portion External fragmentation (unused pieces of main inefficiency of page) memory) Efficient disk traffic Yes (adjust page size to balance access Not always (small segments may transfer just a time and transfer time) few bytes)

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition Code Data

Paging

Segmentation

Comparison showing how paging and segmentation divide a program's instructions and data Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition 8 bits 4 bits 32 bits 24 bits

AttributesGD Base Limit

Code segment

Present DPL 11 Conforming Readable Accessed

Data segment

Present DPL 10 Expand down Writable Accessed

8 bits8 bits 16 bits 16 bits

AttributesWord Destination selector Destination offset count

Call gate

Present DPL 0 00100

FIGURE 5.45 The Pentium segment descriptors are distinguished by bits in the at- tributes field. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition Virtual memory Physical main memory

Virtual Physical address: address: 0 A 0 4K B 4K C 8K C 8K 12K D 12K 16K A 20K 24K B 28K

D Disk

A logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ADDRESS TRANSLATION (1)

The logical address (known to the running process) is partitioned into a virtual page number and a page oset . Virtual page number must be translated to a physical page number . Mapping of virtual to physical page numbers is fully associative (huge page fault penalty need lowest possible miss rate) ⇒ Sophisticated algorithms are used Would one use dierent algorithms for a solid state disk than for a rotating magnetic disk? Page table (in main memory because of size) . An entry is indexed with the virtual page number . Each entry contains the physical page number and a valid bit . Page table register (in processor) points to start of page table . Valid bit = 0 means page not in memory (page fault)

c C. D. Cantrell (02/1999) Virtual address

Virtual page number Page offset

Page Main memory table Physical address

Mapping a virtual address to a physical address via a page table Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SYSTEM CALL IMPLEMENTATION OF PAGE MAPPING

The system call pa = mmap(addr, len, prot, flags, fildes, off); returns a physical address pa, given the memory object represented by the le descriptor fildes . The address ranges are [pa, pa + len) and [off, off + len) . These ranges must be legitimate for the possible (not necessarily current) address space of a process and the memory object . A reference to an address beyond the end of the memory object results in a SIGBUS or SIGSEGV signal The virtual physical mapping may be many-to-one →

c C. D. Cantrell (05/1999) VIRTUAL-TO-PHYSICAL ADDRESS MAPPING

Virtual address 31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Virtual page number Page offset

Translation

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Physical page number Page offset

Physical address

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition INDEXING THE PAGE TABLE

Page table register

Virtual address

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Virtual page number Page offset

20 12

Valid Physical page number

Page table

18 If 0 then page is not present in memory

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Physical page number Page offset

Physical address

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE PAGE TABLE IMPLEMENTS THE MAPPING

Virtual page number Page table Physical memory Physical page or Valid disk address

1 1 1 1 0 1 1 0 1 Disk storage 1 0 1

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition Virtual address seg0/seg1 000 … 0 or Level1 Level2 Level3 Page offset Selector 111 … 1

Page table base register +

L1 page table L2 page table

Page table entry + L3 page table Page table entry +

Page table entry

Physical address Physical page-frame number Page offset

Main memory

The mapping of an Alpha virtual address

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition seg0 seg1 Address space Address space

The organization of seg0 and seg1 in the Alpha

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ADDRESS TRANSLATION (2)

Page table in memory 2 memory accesses for every reference! ⇒ Solution: the translation lookaside buer (TLB) . A “cache” (on the processor die) for translated addresses . Much faster access than page table (in main memory) . Accessed on every memory reference . Fully associative (small size short search time) ⇒ . Contents: Physical page number Valid and dirty bits . On a TLB miss: Consult page table to locate referenced page on disk Write back the old page (if page is “dirty”) The TLB is implemented on the processor die (in modern designs) . PPC 750 has separate 128-entry data and instruction TLBs c C. D. Cantrell (02/1999) THE TLB AS A CACHE FOR THE PAGE TABLE

TLB Virtual page Physical page number Valid Tag address

1 1 Physical memory 1 1 0 1

Page table Physical page Valid or disk address

1 1 1 Disk storage 1 0 1 1 0 1 1 0 1

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition DECSTATION 3100 TLB AND CACHE

Virtual address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0 Virtual page number Page offset 20 12

Valid Dirty Tag Physical page number TLB

TLB hit

20

Physical page number Page offset Physical address Physical address tag Cache index Byte offset 16 14 2

Valid Tag Data

Cache

32

Cache hit Data

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROTECTION WITH VIRTUAL MEMORY

Implemented by both the hardware and the operating system . Write access bit in TLB can protect a page from being written To support protection, ISA must provide processor privilege levels (user, kernel, ...) . Privilege level is selected by setting mode bits . User-level process cannot change all of CPU state . Kernel-level process can change mode bits and page table register If pages are in kernel address space, kernel can change them but user processes can’t Mechanisms to change privilege level: . User kernel: System call → . Kernel user: RFE → Protection mechanisms during context switch: . Flush TLB entries for exiting process . Augment tag bits with process number c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HANDLING PAGE FAULTS

A page fault causes an exception . Fault must be detected and asserted in the same clock period as the memory access that caused the fault . PC and registers are saved as of this clock period because exception processing starts with the next instruction . Memory write disabled by deasserting the write enable signal . When the kernel recognizes from the Cause register that a page fault has occurred, it saves the state of the active process and: Looks up the page table entry to nd the location on disk Chooses a physical page to replace A “dirty” page must be written to disk before it’s replaced Initiates a disk read operation to bring in the referenced page . The kernel chooses another process to run while the page is fetched . The MIPS ISA makes restarting the original process easy Memory is accessed only towards the end of execution c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

HANDLING TLB MISSES

Causes of a TLB miss: . The page is present in physical memory, but lacks a TLB entry A new TLB entry is created, replacing an old one Can be handled either in hardware or in software, because few operations are required . The page is not present in physical memory A page fault occurs

c C. D. Cantrell (05/1999) Page-frame Page CPU Data page-frame Page address <30> offset<13> address <30> offset<13> PC Instruction <64>Data Out <64> Data in <64>

1 2 5 18 19 23 <1><2><2> <30> <21> <1><2><2> <30> <21> V R W Tag Physical address V R W Tag Physical address I D T <64> T <64> L 1 <64> L 18 B B

21 3 (High-order 21 bits of 12:1 Mux (High-order 21 bits of 32:1 Mux physical address) <21> physical address) <21> 2 5 25 19 23

<8> <5> <8> <5> Index Block Index Block Delayed write buffer I offset D offset C C 26 12 A (256 ValidTag Data A (256 ValidTag Data C blocks) <1> <21> <64> 9 12 C blocks) <1> <21> <64> H H E 2 E 19

23 5

4 22 =? =?

24 13 <29> Instruction prefetch stream buffer <29> 27 Tag <29> Data <256> =? Tag <29> Data <256> 7 =? 9 8 Write buffer 6 28 12 4:1 Mux Alpha AXP 21064

V D Tag Data 28 <13> <16> <1><1> <13> <256> 16 Tag Index L2 C 10 Main A 14 memory C (65,536 H blocks) E 20 17 15 11 12 Magnetic =? Victim buffer 17 disk

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Edition DECSTATION 3100 TLB/CACHE FLOWCHART

Virtual address

TLB access

TLB miss No Yes TLB hit? exception Physical address

No Yes Write?

Try to read data from cache No Write access Yes bit on?

Write protection exception Write data into cache, No Yes update the tag, and put Cache miss stall Cache hit? the data and the address into the write buffer

Deliver data to the CPU

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INPUT/OUTPUT (I/O) SUBSYSTEMS

Overview of I/O performance measurement and analysis Processor interface issues Buses Types and characteristics of I/O devices . Hard disk storage . Network interfaces I/O system design

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MOTIVATION FOR STUDYING I/O

CPU performance improves by 50% to 100% per year I/O systems’ performance improvements are limited by physics (in some cases) . Mechanical delays (disk drives): Latency improvement is of order 5% per year . Electrical and optical phenomena (dispersion, attenuation, crosstalk): Improvement is 5% to 25% per year Amdahl’s law implies that, sooner or later, most of the latency will be due to the part that is hardest to improve . Given: 10% of instructions perform I/O, CPU is 10 x faster . Improvement is only 5 x lose 50% of improvement ⇒ I/O bottleneck lowers the value of CPU improvements . As technology evolves, a diminishing fraction of total latency is due to the CPU c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

I/O PERFORMANCE METRICS

Bandwidth (bits or bytes per second): . Peak . Sustained . Useful for buses and networks Throughput (I/O processes per second) . Useful for le serving and transaction processing Latency = total time for an I/O process from start to nish . Most important to users Latency too great user loses train of thought ⇒ Latency no. bytes = controller time + wait time + bandwidth + CPU time overlap

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSOR INTERFACE ISSUES

Interconnections . Buses Processor interface . Interrupts . Memory-mapped I/O I/O control structures . Polling . Interrupts . DMA . I/O controllers . I/O processors Capacity, access time, bandwidth, cost

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OVERVIEW OF UNIX FILE SYSTEM (1)

• A file is a byte sequence The operating system does not understand the file contents No record boundaries, record blocking, indexed files or typed files Resembles the model of memory • System calls for file access: open returns a file descriptor read and write accomplish data transfer seek provides random access ◦ One argument is offset (absolute or relative) • Information about a file is stored in its inode The structure of an inode is defined in /usr/src/sys/sys/stat.h inode = directory entry (more than one entry can point to an inode) A file’s inode can be read with the stat system call c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OVERVIEW OF UNIX FILE SYSTEM (2)

• Hierarchical name space Directory tree, root at / May span multiple filesystems ◦ Each filesystem exists on one physical partition ◦ Each filesystem has its own directory tree ◦ One filesystem on one partition is mapped to the name “/” This is the root filesystem • File system mounting: Means attaching the hierarchy of one filesystem at a mount point in the root filesystem Uses mount system call A mount point must exist (or must be created) as a name in the directory tree Makes physical boundaries invisible Still not possible in Windows... c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNIX mount COMMAND

• Used to mount a filesystem: mount [ -F FSType ] [ generic_options ] [ -o specific_options ] [ -O ] special |mount_point • Used to display information about mounted filesystems: thor% mount / on /dev/dsk/c0t0d0s0 read/write/setuid/largefiles on Sun Mar 14 08:35:32 1999 /proc on /proc read/write/setuid on Sun Mar 14 08:35:32 1999 /dev/fd on fd read/write/setuid on Sun Mar 14 08:35:32 1999 /tmp on swap read/write on Sun Mar 14 08:35:34 1999 /thor on /dev/dsk/c0t0d0s1 setuid/read/write/largefiles on Sun Mar 14 08:35:34 1999 /thor/optics on wotan:/optics read/write/remote on Sun Mar 14 08:35:54 1999 /warez/this on f220as02:/warez/this noquota/remote on Sun Mar 14 08:35:55 1999 /cdrom/hplogic on /vol/dev/dsk/c0t2d0/hplogic read only/nosuid on Sun Mar 14 08:36:12 1999 /home/thor on /thor read/write on Sun Mar 14 16:35:06 1999 • To unmount a filesystem, one uses umount

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4.4BSD DISK FILESYSTEM ORGANIZATION (1)

• Outlined in /usr/src/sys/ufs/lfs/README • A disk is laid out in segments First 8 kB of disk is used for boot information First segment starts 8 kB into the disk Each segment is composed of the following: ◦ An optional super block ◦ One or more groups of: Segment summary 0 or more data blocks 0 or more inode blocks ◦ A summary block describes the data/inode blocks that follow ◦ First segment at creation time: ______|8K pad |Super |summary |inode |ifile |root | l+f| | |block | |block | |dir |dir | |______|______|______|______|______|______|______| c C. D. Cantrell (03/1999) ______|| || | |summary |data/inode |summary |data/inode | |block | blocks |block | blocks |... |______|______|______|______|

Summary block (detail) ______|sum cksum | |data cksum | |next segment | |timestamp | |FINFO count | |inode count | |flags | |______| |FINFO-1 |0or more file info structures, identifying the |. |blocksin the segment. |.| |.| |FINFO-N | |inode-N | |.| |.| |. |0or more inode daddr_t’s, identifying the inode |inode-1 |blocksin the segment. |______| THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

4.4BSD DISK FILESYSTEM ORGANIZATION (2)

• A superblock maintains: A list of all superblocks Enough information to find the inode for the ifile • The ifile Visible in the file system as inode number IFILE INUM Contains information shared between the kernel and user processes ______|cleaner info |Cleaner information per file system. (Page ||granularity.) |______| |segment |Space available and last modified times per |usage table |segment. (Page granularity.) |______| |IFILE-1 |Perinode status information: current version #, |. |ifcurrently allocated, last access time and |. |currentdisk address of block containing inode. |. |Ifcurrent disk address is LFS_UNUSED_DADDR, the |IFILE-N |inodeis not in use, and it’s on the free list. |______| c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

UNIX ufs FILESYSTEM CREATION

example# newfs -Nv /dev/rdsk/c0t0d0s6 mkfs -F ufs -o N /dev/rdsk/c0t0d0s6 1112940 54 15 8192 1024 16 10 60 2048t0-18/dev/rdsk/c0t0d0s6: 1112940 sectors in 1374 cylinders of 15 tracks, 54 sectors 569.8MB in 86 cyl groups (16 c/g, 6.64MB/g, 3072 i/g) super-block backups (for fsck -b #) at: 32, 13056, 26080, 39104, 52128, 65152, 78176, 91200, 104224, ...

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SOLARIS STAT STRUCTURE (1)

• Members of the Solaris 2.6 stat structure include the following: mode_t st_mode; /* File mode (see mknod(2)) */ ino_t st_ino; /* Inode number */ dev_t st_dev; /* ID of device containing */ /* a directory entry for this file */ dev_t st_rdev; /* ID of device */ /* This entry is defined only for */ /* char special or block special files */ nlink_t st_nlink; /* Number of links */ uid_t st_uid; /* User ID of the file’s owner */ gid_t st_gid; /* Group ID of the file’s group */ off_t st_size; /* File size in bytes */ time_t st_atime; /* Time of last access */ time_t st_mtime; /* Time of last data modification */ time_t st_ctime; /* Time of last file status change */ /* Times measured in seconds since */ /* 00:00:00 UTC, Jan. 1, 1970 */ long st_blksize; /* Preferred I/O block size */ blkcnt_t st_blocks; /* Number of 512 byte blocks allocated*/

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SOLARIS STAT STRUCTURE (2)

• Descriptions of structure members from man page for stat (2) st_mode The mode of the file as described in mknod(2). In addition to the modes described in mknod(), the mode of a file may also be S_IFLNK if the file is a symbolic link. S_IFLNK may only be returned by lstat().

st_ino This field uniquely identifies the file in a given file system. The pair st_ino and st_dev uniquely identifies regular files.

st_dev This field uniquely identifies the file system that contains the file. Its value may be used as input to the ustat() function to determine more information about this file system. No other meaning is associated with this value.

st_rdev This field should be used only by administrative commands. It is valid only for block special or character special files and only has meaning on the system where the file was configured.

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SOLARIS STAT STRUCTURE (3)

• Descriptions (concluded) st_nlink Number of links (administrative use only). st_uid The user ID of the file’s owner. st_gid The group ID of the file’s group. st_size For regular files, this is the address of the end of the file; undefined for block or character special. st_atime Time when file data was last accessed. Changed by the following functions: creat(), mknod(), pipe(), utime(2), and read(2). st_mtime Time when data was last modified. Changed by the following functions: creat(), mknod(), pipe(), utime(), and write(2). st_ctime Time when file status was last changed. Changed by the following functions: chmod(), chown(), creat(), link(2), mknod(), pipe(), unlink(2), utime(), and write(). st_blksize Hint as to the "best" unit size for I/O operations; undefined for block or character special files. st_blocks The total number of physical blocks of size 512 bytes actually allocated on disk; undefined for block or character special files.

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SOLARIS MODE INTEGER

• Contains the file access permissions (digits are octal): S_ISUID 04000 Set user ID on execution. S_ISGID 020#0 Set group ID on execution if # is 7, 5, 3, or 1. Enable mandatory file/record locking if # is 6, 4, 2, or 0. S_ISVTX 01000 Save text image after execution. On directories, restricted deletion flag. S_IRWXU 00700 Read, write, execute by owner. S_IRUSR 00400 Read by owner. S_IWUSR 00200 Write by owner. S_IXUSR 00100 Execute (search if a directory) by owner. S_IRWXG 00070 Read, write, execute by group. S_IRGRP 00040 Read by group. S_IWGRP 00020 Write by group. S_IXGRP 00010 Execute (search if a directory) by group. S_IRWXO 00007 Read, write, execute by others. S_IROTH 00004 Read by others. S_IWOTH 00002 Write by others S_IXOTH 00001 Execute (search if a directory) by others.

c C. D. Cantrell (03/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

FILE DESCRIPTORS

• File descriptor: A small positive integer int fd; char buf[256]; . fd = open("file.txt", O_RDWR |O_CREAT, S_IRUSR); write(fd, buf, sizeof(buf)); close(fd); • Sending and receiving work similarly for sockets Setup is different (no open, seek) ◦ Some sockets must passively await connections ◦ Destination address may be specified in each datagram, not bound at socket creation time Permissions are handled differently • On a Solaris machine, descriptors used by the process with ID number are listed in the directory /proc//fd c C. D. Cantrell (10/1999) proc{} filedesc{} *file{}[] p p_fd fd_ofiles fd

file{}

f_ops f_data socketops: socket{} soo_read soo_write so_type soo_ioctl so_proto soo_select protosw{} soo_close

Process, file and socket structures THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BUSES

Bus: A communication link shared by multiple subsystems . Physically: Parallel conductors (traces on die or PC board; cable) . Advantages: Low cost (compared to point-to-point wiring) Versatility of interconnections . Disadvantages: Electrical problems short length ⇒ Bus skew Dispersion Crosstalk Shared resource contention ⇒ . Organization: Control lines to signal & acknowledge requests Data lines to carry addresses, data or commands c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROCESSOR–I/O INTERFACE BUS TYPES

Backplane bus . Processor, memory and I/O devices coexist on the same bus . In olden times, often built into the backplane of a computer An interconnection structure that was part of the chassis . Processor architecture includes explicit I/O instructions (IN, OUT) . Standard backplane buses: VMEbus, Multibus, NuBus, PCI, ISA (Industry Standard Architecture) bus I/O bus . Examples: IDE, SCSI

c C. D. Cantrell (05/1999) Backplane bus Processor Memory a. Processor, memory and I/O devices on the same bus I/O devices

Backplane bus Processor Memory b. Processor and Bus Bus Bus memory are on a adapter adapter adapter backplane bus; bus adapters provide interfaces for various I/O buses I/O I/O I/O bus bus bus

Processor-memory bus Processor Memory c. Processor and Bus memory are on a adapter fast synchronous Bus I/O bus bus adapter

A bus adapter Backplane interfaces the bus processor-memory bus to the backplane Bus I/O bus bus adapter I/O SYSTEM USING ONLY A BACKPLANE BUS

Interrupts Processor

Cache

Memory–I/O bus

I/O I/O I/O Main controller controller controller memory

Network Disk Disk Graphics output I/O SYSTEM USING AN I/O BUS

CPU-memory bus

Cache Bus adapter Main memory CPU

I/O bus

I/O I/O I/O controller controller controller

Disk Disk Graphics Network output MACINTOSH 72xx I/O SYSTEM

Processor

Stereo Serial Apple PCI input output ports desktop bus Main interface/ memory I/O I/O I/O controller controller controller

BACKPLANE BUS PCI

I/O I/O I/O CDROM controller controller CONTROLLERS AND BUS ADAPTERS SCSI Graphics Ethernet Disk bus output

Tape PENTIUM II I/O SYSTEM

Cache bus Local bus Memory bus

Level 2 PCI Main cache CPU bridge memory

PCI bus BACKPLANE BUS

I/O Graphics CONTROLLERS SCSI USB ISA IDE adaptor Available AND BUS bridge disk PCI slot ADAPTERS Mon- itor Mouse Key- board ISA bus

Sound Modem Printer Available card ISA slot

Tanenbaum, Structured Computer Organization Enterprise 10000 hardware architecture

Gigaplane-XB Includes: XB-Interconnect, 4 address buses, bulk power distribution

Local Power Converters

¥ Data is packet-switched using a crossbar ¥ Addresses are broadcast THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

3-STATE BUFFER

A 3-state buer has 2 inputs and 1 output . Enable asserted: Output = input (state is either 0 or 1) . Enable deasserted: High-impedance state (denoted or Z) Output can be driven by another device . Equivalent to a mechanical switch

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXCITATION TABLE FOR 3-STATE BUFFER

A tristate buer has 3 possible output values: . Asserted . Deasserted . High impedance (oating)

enable in out 0 0 Z 0 1 Z 1 0 0 1 1 1

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

USE OF TRISTATES TO ENABLE/DISABLE BUS ACCESS

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

BUS DESIGN CONSTRAINTS

Laws of physics limit bus speeds . Transmission speed speed of light . Crosstalk Occurs because: A time-varying voltage on a conductor induces a charge q2 = C12 v1 on another, parallel conductor A time-varying current in a conductor induces a voltage v2 = L12 di1/dt in another, parallel conductor Limits bus clock frequency Can be reduced by: Grounding alternate conductors Abandoning the bus concept and using twisted-pair, point-to- point connections (Seymour Cray) EMI & reections limit number of devices connected to bus Real estate on die or PC board limits number of lines c C. D. Cantrell (02/1999) COMPLEX ULTRA-SCSI CHAIN

3 METERS (10 FEET) OVERALL LENGTH (INDIVIDUAL MEASUREMENTS IN CENTIMETERS)

7.62 30.48 30.48210.82 10.16 10.16 10.16 10.16 10.16 TERMINATOR TERMINATOR

THREE 12.45-CM STUBS, FIVE 7.37-CM STUBS, 25 pF EACH 25 pF EACH

DRIVER DEVICE 7 6 5 43210 POSITION ACK SIGNALS ON COMPLEX ULTRA-SCSI CHAIN

ACK SIGNALS

6 DRIVER INPUT 4 LOGIC SIGNAL 2 DRIVING SCSI DRIVER DEVICE 0 POSITION 4 2 7 0 DRIVER OUTPUT @ 7 4

2 6 0 DRIVER INPUT @ 6 4

2 5

2 VOLTS PER DIVISION VOLTS 2 0 DRIVER INPUT @ 5 4

2 4 0 DRIVER INPUT @ 4 4

2 0 0 DRIVER INPUT @ 0

10 NANOSECONDS PER DIVISION ACK SIGNALS ON POINT-TO-POINT ULTRA-SCSI BUS

25 METERS (82 FEET) OVERALL LENGTH TERMINATOR TERMINATOR

DRIVERONLY END LOADS FOR THIS TEST RECEIVER SHIELDED 34-PAIR EXTERNAL CABLE

ACK SIGNAL 6 DRIVER INPUT 4 2 0 4

2 0 DRIVER OUTPUT 4

2 VOLTS PER DIVISION VOLTS 2 2 0 RECEIVER INPUT AFTER 25 M

10 NANOSECONDS PER DIVISION THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASYNCHRONOUS vs. SYNCHRONOUS BUSES

Bus communication protocol: Specication of sequence of events and timing requirements for transferring information on a bus Asynchronous bus transfers: . Certain conductors on the bus are control lines . Signals on the control lines control the sequence of events Synchronous bus transfers: . Events are sequenced relative to a master clock signal . Once a certain kind of transfer has been initiated, no further command signaling is necessary to control the transfer

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SYNCHRONOUS BUSES

Bus clock is phase-locked to processor clock . Bus clock frequency = 1 processor clock frequency (n = 1 to 6) n . Clock signal is carried on a control line . Communications protocol dened with reference to bus clock signal . Local bus (e.g., VESA Local Bus): Extends the processor’s bus control signals May connect processor to L2 cache May connect processor and memory to high-speed I/O devices Advantages: . Fast & wide . Simple logic (nite state machine) Disadvantages: . Must be short (bus skew; attenuation; crosstalk) . All devices must run at same frequency c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

80286 – PENTIUM I/O

Separate I/O and memory address spaces . Since the 8086, I/O or memory access is signaled by M/IO# (memory access if high, I/O if low) For MOVE (memory–CPU copy), M/IO# is high For IN or OUT (I/O), M/IO# is low M/IO# is a processor signal that does not appear on the ISA bus Instead, M/IO# is an input to the bus controller . I/O address space is 0x0000 to 0xffff

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

80286 SIGNALS

80286 31 7 CLK A23 8 A22 51 10 D15 A21 49 11 D14 A20 47 12 D13 A19 45 13 Upper data D12 A18 43 14 bus transceiver D11 A17 41 15 D10 A16 39 16 D9 A15 37 17 D8 A14 50 18 D7 A13 48 19 D6 A12 46 20 Address D5 A11 Lower data 44 21 D4 A10 latch 42 22 bus transceiver D3 A9 40 23 D2 A8 38 24 D1 A7 36 25 D0 A6 26 A5 63 27 READY A4 64 28 HOLD A3 57 32 INTR A2 59 33 NMI A1 61 34 PEREQ A0 54 BUSY 53 1 ERROR BHE 67 M/IO/ 66 COD/INTA/ 68 LOCK 29 65 RESET HLDA 6 PEACK 4 S1 5 S0

CAP 60 52 62

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Scienc e

ISABUS

ISA IndustryStandardArchitecture .Synchronous .IndustryresponsetoIBM’sMicroChannelarchitecture .UsesboththePC/ATandtheIBMPCbusstandards Interfacecardshave2setsofconnectors PCbus:8datalines,20addresslines ISAbus:16datalines,24addresslines;busfrequency8.33MHz Maximumpossiblethroughput:2bytes 8.33MHz=16.67MB/s .SeparateI/Oandmemoryaddressspaces Sincethe8085,I/OormemoryaccessissignaledbyM/IO# (I/Oifhigh,memoryaccessiflow) ForMOVE(memory–CPUcopy),M/IO#ishigh ForINorOUT(I/O),M/IO#islow I/Oaddressspaceis0x0000to0xffff c C.D.Cantrell(05/1999) ISA BUS CONNECTORS

PC bus Plug-in Motherboard connectors PC bus Contact board

Chips

CPU and other chips

New connector for PC/AT Edge connector

Tanenbaum, Structured Computer Organization THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PCI BUS

PCI Peripheral Component Interconnect . Synchronous . PCI 1.0: Clock frequency 33 MHz, 32-bit-wide data path . PCI 2.1: Clock frequency 66 MHz, 64-bit-wide data path Maximum theoretical bandwidth: 8 bytes 66 MHz = 528 MB/s . Transactions are negative-edge-triggered . Address and data lines are multiplexed . Bus arbiter usually built into the chipset . Every PCI device has a 256-byte conguration address space that is readable by other devices Plug ’n Play ⇒ PCI cards . Options include voltage (5 V vs. 3.3 V), width (32 bits/120 pins vs. 64 bits/184 pins) and frequency (33 vs. 66 MHz) c C. D. Cantrell (05/1999) PCI BUS ARBITER

PCI GNT# GNT# GNT# GNT# REQ# REQ# REQ# REQ# arbiter PCI PCI PCI PCI device device device device

Tanenbaum, Structured Computer Organization PCI BUS TIMING FOR READ AND WRITE CYCLES

Bus cycle

Read Idle White

T1 T2 T3 T4 T5 T6 T7

Φ

Tur naround AD AddressData Address Data

C/BE# Read cmdEnable Wr ite cmd Enable

FRAME#

IRDY#

DEVSEL#

TRDY#

Tanenbaum, Structured Computer Organization PCI BUS SIGNALS

MANDATORY PCI BUS SIGNALS

Signal Lines Master Slave Description CLK 1 Clock (33 MHz or 66 MHz) AD 32 ××Multiplexed address and data lines PAR 1 × Address or data parity bit C/BE 4 × Bus command/bit map for bytes enabled FRAME# 1 × Indicates that AD and C/BE are asserted IRDY# 1 × Read: master will accept; write: data present IDSEL 1 × Select configuration space instead of memory DEVSEL# 1 × Slave has decoded its address and is listening TRDY# 1 × Read: data present; write: slave will accept STOP# 1 × Slave wants to stop transaction immediately PERR# 1 Data parity error detected by receiver SERR# 1 Address parity error or system error detected REQ# 1 Bus arbitration: request for bus ownership GNT# 1 Bus arbitration: grant of bus ownership RST# 1 Reset the system and all devices

OPTIONAL PCI BUS SIGNALS

Sign Lines Master Slave Description REQ64# 1 × Request to run a 64-bit transaction ACK64# 1 × Permission is granted for a 64-bit transaction AD 32 × Additional 32 bits of address or data PAR64 1 × Parity for the extra 32 address/data bits C/BE# 4 × Additional 4 bits for byte enables LOCK 1 × Lock the bus to allow multiple transactions SBO# 1 Hit on a remote cache (for a multiprocessor) SDONE 1 Snooping done (for a multiprocessor) INTx 4 Request an interrupt JTAG 5 IEEE 1149.1 JTAG test signals M66EN 1 Wired to power or ground (66 MHz or 33 MHz)

Tanenbaum, Structured Computer Organization THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

OBTAINING BUS ACCESS

Goal: Give every device fair access Method: Use bus masters . A master enables bus access for one or more devices (by enabling/disabling tristate buers) . Single bus master can be a bottleneck . Multiple masters require arbitration Every device has a priority (IRQ number, SCSI ID, ...) Extra control lines needed for bus request/access . Arbitration methods: Centralized & parallel (SCSI) Daisy chain (VMEbus) Distributed arbitration using self-selection (NuBus) Distributed arbitration using collision detection (Ethernet)

c C. D. Cantrell (02/1999) A BUS TRANSACTION WITH A SINGLE MASTER

Bus request lines

Memory Processor Bus

a. Device generates bus request Disks

Bus request lines

Memory Processor Bus

b. Master (processor) responds by generating Disks control signals (for read, etc.) Bus request lines

Memory Processor Bus

c. Processor notifies I/O device that its request is Disks being processed; device then puts address for the request on the bus DAISY CHAIN

Highest priority Lowest priority

Device 1 Device 2 Device n

Grant Grant Grant Bus Release arbiter

Request

A daisy chain bus uses a bus grant line that chains through each device from highest to lowest priority. The protocol is: 1. Signal on the request line 2. Wait for a low-to-high transition on the grant line (indicates reassignment) 3. Intercept the grant signal and stop asserting the request line 4. Use the bus 5. Signal that the bus is no longer required by asserting the release line THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ASYNCHRONOUS BUSES

Not clocked . Can accomodate many kinds of devices (disk, tape, scanner, ...) Data transfer controlled with handshaking protocol on dedicated control lines; represent with a nite state machine for each device Example (SCSI-1 bus): . Bus controller asserts Sel (select device) and transmits device ID . Selected device responds with Ack . Controller asserts Cmd (command), Msg (message), and Req (request a data transfer) signals, then transmits command bytes . Device responds to each byte with Ack . Controller deasserts Cmd, asserts I/O, then transmits data bytes . Device responds to each byte with Ack

c C. D. Cantrell (02/1999) STEPS OF AN ASYNCHRONOUS OUTPUT OPERATION

Control lines

Memory Processor Data lines

Disks a. Initiation of a read operation from memory. Control lines: Read command; Data lines: Address

Control lines

Memory Processor Data lines

Disks b. Memory access

Control lines

Memory Processor Data lines

Disks c. Memory puts the data on the data lines of the bus and uses the control lines to signal the I/O device that the data is available STEPS OF AN ASYNCHRONOUS INPUT OPERATION

Control lines

Memory Processor Data lines

Disks a. Control lines: Write request to memory; Data lines: Address

Control lines

Memory Processor Data lines

Disks b. Memory signals the device that it is ready; Data is transferred ASYNCHRONOUS BUS HANDSHAKING PROTOCOL

ReadReq 1 3 Data 4 2 2 6 4 Ack 5 7 DataRdy

1. When memory sees ReadReq asserted, it reads the address from the data bus and asserts Ack 2. I/O device sees Ack asserted, releases ReadReq and data lines 3. Memory sees ReadReq deasserted, drops Ack to acknowledge ReadReq 4. Memory puts requested data on the data lines, asserts DataRdy 5. I/O device sees DataRdy, reads data, signals that it has seen the data by asserting Ack 6. Memory sees Ack, drops DataRdy, releases data lines 7. I/O device sees DataRdy deasserted, drops Ack to signal end of transmission

I/O device Memory New I/O request

I/O device Memory

______Ack ReadReq Put address on data lines; assert ReadReq

Ack ReadReq ______DataRdy ReadReq 1 2 Record from Release data data lines lines; deassert and assert ReadReq Ack ______DataRdy ReadReq ___ 3, 4 DataRdy 5 Ack Drop Ack; Read memory put memory data from data data on data lines; lines; assert assert Ack DataRdy ______DataRdy Ack

6 7 Release data Deassert Ack lines and DataRdy New I/O request THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SCSI-1: AN ASYNCHRONOUS BUS (1)

SCSI := Small Computer System Interface . Many “standard” implementations . Can connect many dierent kinds of devices: Logic board Hard drive CD-ROM drive Tape drive Scanner . Controller chip on logic board or plug-in . Controller is connected by cable to internal or peripheral devices . Devices are daisy-chained . Device ID is set by hardware switches

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SCSI-1: AN ASYNCHRONOUS BUS (2)

SCSI-1 bus conguration . Peripheral SCSI-1 devices are connected by cable . Each bit of a data byte is transferred on a separate wire (line) of the cable . Each device must have a unique ID number between 0 and 7 The ID is signaled by asserting one of the lines DB(0) – DB(7) In case of contention, the device with the highest ID wins The logic board has ID 7, so it always wins

c C. D. Cantrell (02/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SCSI ID BITS

http://scitexdv.com/SCSI2/ THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SCSI-1: AN ASYNCHRONOUS BUS (3)

SCSI signaling sequence for data transfer . Controller broadcasts SEL (select) signal on pin 44 and the ID number on one of the data lines . Device selected responds with ACK (acknowledge) signal on pin 48 (handshake) . Controller sends REQ (request) signal on pin 48 to order device to perform a task (such as transferring a data byte) . Command bytes are transferred on the data bus . A handshake must take place for each data byte transferred

c C. D. Cantrell (02/1999) SCSI Bus Signals Signal Driven By Signal Explanation DB0ÐDB7 Initiator/Target 8-Bit Bidirectional Data Bus. DBP Initiator/Target Data-Bus Parity Line. Optional. ATN Initiator Attention. Used to send a message to the target when it controls the bus. BSY Initiator/Target Busy. Indicates that the bus is unavailable for use. ACK Initiator Acknowledge. Used by the initiator for handshaking. RST Any Device Reset. Used to initiate a bus-free phase. MSG Target Driven by the target to indicate that the current transfer is a message. Select. Used by the initiator to select a target before command execution. Also used SEL Initiator by the target to reconnect when the reselection phase is implemented. Control/Data. Used during the information transfer phases to transfer commands, sta- C/D Target tus, data or messages over the bus. REQ Target Request. Used by the target during information transfer phases. I/O Target Input/Output. Determines the direction of the transfer.

SELECTION

ARBITRATION RESELECTION (OPTIONAL) (OPTIONAL)

BUS FREE COMMAND

MESSAGE (OPTIONAL) DATA

STATUS

Phase Sequences of the SCSI Bus SCSI Information Transfer Phases Signal SEL BSY MSG C/D I/O Direction Phase 0 1 0 0 0 To Target Data Out 0 1 0 0 1 From Target Data In 0 1 0 1 0 To Target Command 0 1 0 1 1 From Target Status 0 1 1 0 0 — Reserved 0 1 1 0 1 — Reserved 0 1 1 1 0 To Target Message Out 0 1 1 1 1 From Target Message In THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SCSI BUS TOPOLOGY

http://homebrew.cs.ubc.ca/415/project-submissions/group9/notes/scsi-2.html THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

I/O AS SYNCHRONIZATION OF DATA TRANSFERS

Fundamental problems of communication between devices, or between the CPU and peripheral devices: . Detection that a data transfer is necessary Dedicated polling Interrupts Periodic polling . Synchronization of two devices, or a device and a CPU, with dierent speeds Wait state insertion DMA Dual-ported memory FIFO buers Caches

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

POLLED I/O

Wait

A polling loop In most cases it is more Is data is not an efficient efficient for the I/O ready? way to use a CPU device to tell the CPU unless the device when data is ready, is very fast. If the or when a transfer is yes no device is fast, complete, than for the then “data ready” CPU to check the device checks can be frequently. An I/O device Read interspersed can use interrupts to tell data; among useful the CPU that a data transfer done? instructions. should be started, or is finished.

yes no

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

DEDICATED vs. PERIODIC POLLING

Periodic polling means that the CPU periodically interrogates the I/O device (e.g., via an oscillator–counter–decoder combination) to see whether data is ready Dedicated polling (spin waiting) means that the I/O device con- troller sets or clears bits in a status register that is read in a tight loop by the CPU . When a system call for keyboard input is issued, and dedicated polling is in use, the CPU executes code somewhat like this: get_loop: lw $a0, Device_Status bgez $a0, get_loop lb $2, Device_Data rfe . This operation transfers only a single byte; data may be missed . A dierent approach is necessary for block transfers

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERRUPT-DRIVEN I/O (1)

An interrupt is an event that occurs outside the execution cycle and that causes processing of the current thread to stop . Interrupts can be used to give I/O devices a means to signal the CPU that an event has occurred that requires action by the CPU (data is ready, etc.) . An interrupt causes an exception, which results in a jump to the appropriate exception handling code (MIPS: address 0x80000080) . There are (at least) two principal methods for detecting interrupts in hardware: Connect the interrupt request output of an I/O device to one of the inputs of an interrupt controller Interrupts may be level-triggered or edge-triggered Connect one interrupt line to an OR of inputs from several devices that are periodically strobed for data ready Device that caused the interrupt can be detected by reading a c C. D. Cantrell (05/1999) status word formed from inputs from the devices THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERRUPT-DRIVEN I/O (2)

On a RISC machine, an interrupt causes a jump to the general excep- tion handling code (with a few special cases such as Reset and UTLB Miss) . Method of P&H Chapter 5: Execution is suspended immediately This method is required for some exceptions (TLB miss, page fault) unless execution can be undone Restarting is hard in ISAs where memory is accessed at multiple times during execution of an instruction . Method of choice: The instruction that caused the exception is allowed to nish; subsequent instructions are suspended . Pending interrupts must be handled before next instruction is fetched . The exception handler determines the code to execute, based on the Cause register contents . The operating system determines what state needs to be saved (if any) besides the EPC and Cause registers c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERRUPT-DRIVEN I/O (3)

MIPS R2000 interrupt handler: . Saves $a0 and $v0 in special locations save0 is at address 0x90000250; save1 is at address 0x90000254 $a0 and $v0 can’t be pushed onto the stack, because the cause of the exception may be a bad stack pointer! . Copies coprocessor 0 Cause and EPC registers into $k0 and $k1 . Pushes current Kernel/User mode and Interrupt Enable Mode bits onto the stack in the Status register (see next slide) . The kernel’s exception handler uses a jump table (or a sequence of beq’s) to determine the right code to execute (see SPIM kernel text) . The operating system clears the interrupts, if any . After executing an rfe instruction, the processor may restart execution at the address in the EPC

c C. D. Cantrell (05/1999) MIPS R2000 STATUS REGISTER

31 28 22 1615 8 543210

CU

BEV TS PE CM PZ SwC IsC Interrupt Old Previous Current mask

Kernel/ Interrupt Kernel/ Interrupt Kernel/ Interrupt user enable user enable user enable

Stack for kernel/user and interrupt enable bits lets processor respond to two levels of exceptions before software must save the Status register

MIPS R2000 CAUSE REGISTER

15 10 5 2

Pending Exception interrupts code (ExcCode) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

EXCEPTION CODES IN THE MIPS R2000 ISA

ExcCode Name Description 0 Int External interrupt 1 MOD TLB modication exception 2 TLBL TLB miss exception (Load or instruction fetch) 3 TLBS TLB miss exception (Store) 4 AdEL Address error exception (Load or instruction fetch) 5 AdES Address error exception (Store) 6 IBE Instruction fetch bus error exception 7 DBE Data load or store bus error exception 8 Sys System call exception 9 Bp Breakpoint exception 10 RI Reserved or undened instruction exception 11 CpU Coprocessor unusable exception 12 Ovf Arithmetic overow exception

c C. D. Cantrell (05/1999) # SPIM TRAP HANDLER DATA .kdata __m1_: .asciiz " Exception " __m2_: .asciiz " caught by trap handler.\n" __m3_: .asciiz "Continuing. . .\n" __m4_: .asciiz "Halting.\n" __e0_: .asciiz " [Interrupt]" __e1_: .asciiz " [TLB modification !BUG!]" __e2_: .asciiz " [TLB miss !BUG!]" __e3_: .asciiz " [TLB miss !BUG!]" __e4_: .asciiz " [Unaligned address in inst/data fetch]" __e5_: .asciiz " [Unaligned address in store]" __e6_: .asciiz " [Bad address in text read]" __e7_: .asciiz " [Bad address in data/stack read]" __e8_: .asciiz " [Error in syscall]" __e9_: .asciiz " [Breakpoint]" __e10_: .asciiz " [Reserved instruction]" __e11_: .asciiz " [Syscall exception !BUG!]" __e12_: .asciiz " [Arithmetic overflow]" __e13_: .asciiz " [Inexact floating point result]" __e14_: .asciiz " [Invalid floating point result]" __e15_: .asciiz " [Divide by 0]" __e16_: .asciiz " [Floating point overflow]" __e17_: .asciiz " [Floating point underflow]" __excp: .word __e0_,__e1_,__e2_,__e3_,__e4_,__e5_,__e6_,__e7_,__e8_,__e9_ .word __e10_,__e11_,__e12_,__e13_,__e14_,__e15_,__e16_,__e17_ s1: .word 0 s2: .word 0 # SPIM TRAP HANDLER CODE .ktext .space 0x80 # Put trap handler at 0x8000080 sw $v0 s1 # Not re-entrant sw $a0 s2 # Don't need to save k0/k1 mfc0 $k0 $13 # Cause and $k0 $k0 0xff# Use just ExcCode field mfc0 $k1 $14 # EPC li $v0 4 # Print " Exception " la $a0 __m1_ syscall li $v0 1 # Print exception number srl $a0 $k0 2 syscall li $v0 4 # Print type of exception lw $a0 __excp($k0) syscall li $v0 4 # Print " occurred.\n" la $a0 __m2_ syscall srl $a0 $k0 2 beq $a0 12 ret # continue on overflow beq $a0 13 ret # continue on inexact fp result beq $a0 14 ret # continue on invalid fp result beq $a0 16 ret # continue on fp overflow beq $a0 17 ret # continue on fp underflow li $v0 4 # Print "Halting.\n" la $a0 __m4_ syscall li $v0 10 # Exit on all bug overflow exceptions syscall # syscall 10 (exit) ret: li $v0 4 # Print "Continuing. . .\n" la $a0 __m3_ syscall

lw $v0 s1 lw $a0 s2 addiu $k1 $k1 4 # Return to next instruction rfe # Return from exception handler jr $k1

.text .globl __start THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

SINGLE- AND MULTIPLE-LINE INTERRUPT SYSTEMS

CPU

Interrupt flip-flop

SINGLE-LINE INTERRUPT SYSTEM

CPU

Interrupt register 10 2 3 INTERRUPT REQUEST NUMBERS

MULTIPLE-LINE INTERRUPT SYSTEM

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTORED INTERRUPT SYSTEM

Interrupt register

Interrupt Interrupt request Priority number lines encoder to CPU

Input active Interrupt pending

Interrupt mask register

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

INTERRUPT-DRIVEN I/O (4)

In the Motorola 68000 series, the CPU checks for pending interrupts after execution of each instruction . CPU saves status register (SR) and enters supervisor mode . After determining the interrupt number N, the CPU saves state information and executes M[4N] PC, causing a branch to the text at the location pointed to by M[4N]→

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTORED INTERRUPTS IN THE IBM PC

+5 v

INT IRQ0 INTA IRQ1 IRQ2 RD 8259A IRQ3 WR Interrupt A0 controller IRQ4 CS IRQ5 IRQ6 D0-D7 TO IRQ7 CPU

INT IRQ8 INTA IRQ9 IRQ10 RD 8259A IRQ11 WR Interrupt A0 controller IRQ12 CS IRQ13 IRQ14 D0-D7 IRQ15 c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

MEMORY-MAPPED I/O

Instead of having multiple address spaces for memory, I/O, etc., have a single address space . Loading from a memory location that is mapped to an I/O device reads a data byte or word from the device . Storing to a memory location that is mapped to an I/O device writes a data byte or word to the device . Used in Motorola 68000 series In order to synchronize I/O properly, additional memory locations may be mapped to status words for the I/O devices

c C. D. Cantrell (05/1999) SPIM’s MEMORY-MAPPED I/O REGISTERS

Unused 1 1 Receiver control (0xffff0000)

Interrupt Ready enable

Unused 8 Receiver data (0xffff0004) Received byte

Unused 1 1 Transmitter control (0xffff0008)

Interrupt Ready enable

Unused 8 Transmitter data (0xffff000c) Transmitted byte THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

NETWORK INTERFACE CARD

TCLK TE COMMUNICATION TXD ETHERNET CONTROLLER INTERFACE JACK (FRAMING, ADAPTER BUS INTERFACE) CD (SIGNALING) RXD COL

BUS INTERFACE

c C. D. Cantrell (05/1999) MC68360 COMMUNICATIONS CONTROLLER

SIM 60

JTAG

SYSTEM CPU32+ PROTECTIONPERIODIC BREAKPOINT CORE TIMER LOGIC CLOCK GENERATION

OTHER FEATURES DRAM CONTROLLER AND IMB (32 BIT) CHIP SELECTS SYSTEM EXTERNAL I/F BUS INTERFACE

CPM

COMMUNICATIONS PROCESSOR

RISC 2.5-KBYTE DUAL-PORT CONTROLLER FOUR RAM TWO GENERAL- FOURTEEN SERIAL IDMAs INTERRUPT PURPOSE DMAs CONTROLLER TIMERS

SEVEN TIMER SLOT OTHER SERIAL FEATURES CHANNELS ASSIGNER LUCENT LXT905 ETHERNET SERIAL INTERFACE

LXT905 Block Diagram

Mode Select Logic Controller MD0 LI Compatibility / MD1 Loopback / TCLK Link Test RC CLKI XTAL Pulse Watch-Dog TPOP CLKO OSC Shaper CMOS Manchester Timer DO TX TEN & Encoder Filter Amp TPON TXD Loopback RC Control CD Squelch/ Link Detect Collision/ TPIP LEDL Polarity RX RCLK Manchester Detect/ Slicer RXD Decoder Correct TPIN

Collision Logic COL

LEDR LEDT/PDN LEDC/FDE DSQE LBK THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

I/O PROCESSORS

An I/O processor (IOP) is a processor with (usually) a more restricted instruction set than the CPU . Purpose: Ooad I/O processing from the CPU Used in CDC 6600, IBM S/360–370, ... . I/O instructions executed by an IOP are called channel command words in the IBM world . A CPU and its IOPs are really a shared-memory multiprocessor

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

RELATION OF I/O TO PROCESSOR ARCHITECTURE

I/O instructions and buses have disappeared Interrupt vectors have been replaced by jump tables Interrupt stack replaced by shadow registers . Handler saves registers and re-enables higher-priority interrupts Interrupt types reduced in number . Handler must query interrupt controller Caches cause problems for I/O . Flushing degrades performance heavily . Solution: “snooping” (borrowed from shared-memory multiprocessors) Virtual memory frustrates DMA Load-store architecture inconsistent with atomic I/O operations Stateful processors hard to context switch c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PARALLEL PROCESSING

Numerical problems and algorithms Impact of Amdahl’s law Parallel computer architectures . Memory architectures . Interconnection networks Flynn’s taxonomy (SIMD, MIMD, etc.) Shared-memory multiprocessors Distributed-memory, message-passing multicomputers

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMPUTATIONAL PROBLEMS

Numerical simulation has become as important as experimentation . Permits exploration of a much larger region in parameter space . All-numerical designs The most CPU-intensive application areas include: . Computational uid dynamics . Computational electromagnetics . Molecular modeling Typical problem sizes (in double-precision oating-point operations) and execution times (in CPU-days @ 109 FLOPS): Problem size Execution time (CPU-days) 1014 1.16 1015 11.57 1016 115.7 1017 1157 c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

NUMERICAL ALGORITHMS

Most of the important CPU-intensive applications involve the solution of a system of coupled partial dierential equations Initial-value problems . Time-dependent wave propagation (electromagnetics, geophysics, optical networking) . Time-dependent dynamics (uid ow, mechanical engineering) Boundary-value problems . Typically require solution of very large systems of linear equations . Quasi-static electromagnetic problems . Static mechanical engineering (nite element analysis) . Eigenvalue problems (molecular energy surfaces)

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PROPAGATION EQUATIONS FOR 4-WAVE MIXING

Paraxial wave equation: 2 3 ∂ 2 ∂ 3 ∂    + i 2 3 n ∂z  2 ∂t 6 ∂t   0  0 0 F    2 (3) 16 ωn = n+i 2 2 nklmD k , l , m 2F n0,nc Ae | | | | | | ei(nklm)z0 F kF lF m Two regimes to study: . Three strong waves ( , , ) generate nine weak waves ( ) Fk Fl Fm Fn Useful for estimating crosstalk among channels . Parametric coupling of three strong waves Leads to coherent amplication of some channels and de-amplication of others

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PULSE PROPAGATION IN AN OPTICAL FIBER (1)

2

Power (mW) 1

0 Ð0.3 Ð0.1 Frequency 0.1 40 0.3 20

0 Propagation distance (km)

c D. M. Hollenbeck and C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

PULSE PROPAGATION IN AN OPTICAL FIBER (2)

60.0

40.0 Power (mW) 20.0 20.0 15.0 3.4 1.7 10.0 0.0 5.0 -1.7 Propagation distance (km) Time (ps) 0.0 -3.4 c D. M. Hollenbeck and C. D. Cantrell (05/1999) Processor Processor Processor Processor

One or One or One or One or more levels more levels more levels more levels of cache of cache of cache of cache

Main memory I/O system

A centralized, shared-memory multiprocessor Layers where shared memory can be implemented

Machine 1 Machine 2 Machine 1 Machine 2 Machine 1 Machine 2

Application Application Application Application Application Application

Language Language Language Language Language Language run-time run-time run-time run-time run-time run-time system system system system system system

Operating Operating Operating Operating Operating Operating system system system system system system

Hardware Hardware Hardware Hardware Hardware Hardware

Shared memory Shared memory Shared memory Processor Processor Processor Processor + cache + cache + cache + cache

Memory I/O Memory I/O Memory I/O Memory I/O

Interconnection network

Memory I/O Memory I/O Memory I/O Memory I/O

Processor Processor Processor Processor + cache + cache + cache + cache

A distributed-memory multicomputer Interconnection topologies

star full interconnection

tree ring

grid double torus

cube 4-D hypercube Input port CPU 1 Output port A B

End of packet

CDFour-port switch

Middle of packet CPU 2

Front of packet

Interconnection network consisting of a 4-switch square grid CPU 1

A B CPU 2

C D Four-port switch

CPU 3 Input port Output buffer

CPU 4

Deadlock in a circuit-switched interconnection network Two-Hypernode Convex Exemplar System

Functional Blocks

hypernode 0

I/O Interface

CTI Ring

hypernode 1

I/O Interface Typical access times to retrieve a word from a remote memory

Communication Interconnection Processor Typical remote Machine mechanism network count memory access time SPARCCenter Shared memory Bus ≤ 20 1 µs SGI Challenge Shared memory Bus ≤ 36 1 µs Cray T3D Shared memory 3D torus 32Ð2048 1 µs Convex Exemplar Shared memory Crossbar + ring 8Ð64 2 µs KSR-1 Shared memory Hierarchical ring 32Ð256 2Ð6 µs CM-5 Message passing Fat tree 32Ð1024 10 µs Intel Paragon Message passing 2D mesh 32Ð2048 10Ð30 µs IBM SP-2 Message passing Multistage switch 2Ð512 30Ð100 µs Scaling of Scaling of Scaling of computation- Application computation communication to-communication FFT nnlog n ------logn p p LU n n n ------p p p Barnes nnlog nn()log n ------Approximately ------Approximately ------p p p Ocean n n n ------p p p Scaling of computation, of communication, and of the ratio are critical factors in determining performance on parallel machines Scalable vs. non-scalable systems

CPU

Bus n CPUs active … Inherently Potentially sequential parallelizable 1 CPU part part active

f 1 Ð f f 1 Ð f T fT (1 Ð f)T/n

AmdahlÕs law for parallel processing AMDAHL’S LAW FOR PARALLEL COMPUTING

1.0 1 efficiency 0.5 0.8 0.6 0 2 0.4 f = parallelizable fraction 4 6 0.2 8 log2n 0 (n = number 10 of processors) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

THREADS

Threads are the software equivalent of hardware functional units Properties of threads: . Exist in user process space or kernel space User threads are faster to launch than processes . Mappings required for execution: user thread lightweight process kernel thread CPU → → → Reference: Bil Lewis and Daniel J. Berg, Threads Primer (SunSoft/Prentice Hall, 1996)

c C. D. Cantrell (05/1999) Computational paradigms

P1 Work queue P1 P2 P3

P1 P2 P3

Synchronization point P1 P2 P3

P P P P2 4 5 6 P1 P2 P3

P7 P8 Process P3 Synchronization point

P9 Processor Processor Processor

Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data

Single bus

Memory I/O Invalid Processor read miss Read Only (not valid (clean) cache block)

t) hi if Processor write miss e at Processor write lid va (hit or miss) in Processor d en read miss (S

(Write back dirty block to memory)

Read/Write (dirty)

Processor write a. Cache state transitions using signals from the processor

Invalid Read Only (not valid (clean) cache block) Invalidate or another processor has a write miss Another processor has a read for this block miss or a write miss for (seen on bus) this block (seen on bus); write back old block

Read/Write (dirty)

b. Cache state transitions using signals from the bus The MESI cache coherence protocol

CPU 1 CPU 2 CPU 3 (a) Memory CPU 1 reads block A A Exclusive Cache Bus

CPU 1 CPU 2 CPU 3 (b) Memory CPU 2 reads block A A Shared Shared Bus

CPU 1 CPU 2 CPU 3 (c) Memory CPU 2 writes block A A

Modified Bus

CPU 1 CPU 2 CPU 3 (d) Memory CPU 3 reads block A A A Shared Shared Bus

CPU 1 CPU 2 CPU 3 (e) Memory CPU 2 writes block A A

Modified Bus

CPU 1 CPU 2 CPU 3 (f) Memory CPU 1 writes block A A Modified Bus Load lock variable

No Unlocked? (= 0?)

Yes

Try to lock variable using swap: read lock variable and then set variable to locked value (1)

No Succeed? (= 0?)

Yes

Begin update of shared data

Finish update of shared data

Unlock: set lock variable to 0 FlynnÕs taxonomy of parallel computers

Instruction Data streams streams Name Examples 1 1 SISD Classical Von Neumann machine 1 Multiple SIMD Vector supercomputer, array processor Multiple 1 MISD Arguably none Multiple Multiple MIMD Multiprocessor, multicomputer A taxonomy of parallel computers

Parallel computer architectures

SISD SIMD MISD MIMD

(Von Neumann) ?

Vector Array Multi- Multi- processor processor processors computers

UMA COMA NUMA MPP COW

Hyper- Bus Switched CC-NUMA NC-NUMA Grid cube

Shared memory Message passing CRAY-1 registers and functional units

AB S T64 Elements per register 64 8 24-Bit 8 64 24-Bit holding 64-Bit 64-Bit holding address registers scalar registers for 8 64-Bit registers for registers scalars vector registers addresses

ADD ADD ADD ADD

BOOLEAN MUL BOOLEAN MUL Address units SHIFT RECIP. SHIFT POP. COUNT Scalar/vector Vector floatng-point integer Scalar units units integer units THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

COMBINED VECTOR AND SCALAR OPERATIONS

SAXPY (Scalar A times X Plus Y) z = x + y GAXPY (General A times X Plus Y) z = Ax + y

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTORIZATION INHIBITORS

Loops containing . Dependences involving an array . Recursion . I/O statements . Function or subroutine calls . References to a nonvectorizable statement function . A limit or stride that varies with respect to an outer loop . Multiple entrances or exits . Equivalenced variables . Computed or assigned GO TO statements . Vector reduction operations (on some vector computers) . Nonlinear array references to memory . Ambiguous subscripts

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

IMPORTANCE OF VECTOR LENGTH

Suppose that the vector length is lv, i.e. lv vectors can be held in hardware registers. If we need s cycles to set up a vector pipeline, then we need s + n cycles to perform n operations if n l . v If n>lv, we need n + s( n/lv + 1) cycles to perform n operations. This reduces the rate of computationb c by the factor 1 L(n)= s . 1+ ( n/l +1) n b vc The factor n , such that 1/2 1 L(n1/2)=2,

measures the startup overhead. A computer with a large value of n1/2 performs poorly in calculations on short vectors.

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (1)

Design the algorithm to yield an optimized, vectorizable innermost DO loop: . Avoid nonvectorizable constructions . Avoid scalar temporary variables . Use unit-stride memory references . Unroll loops if there is only one pipe to memory . Distribute loops . Put longest vectors in innermost loop . Reorder loops or statements to avoid recurrence on innermost loop Use parallel operations or chaining to perform as much computation as possible in each vector period or cycle Use strip mining to minimize vector load overhead Minimize subroutine or function call overhead (several 102 CPU cycles per call) c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (2)

Avoid scalar temporary variables in a vector loop. Scalar temporaries force unnecessary scalar stores to memory, and may inhibit chaining on CRAYs. . Slow matrix multiplication inner loop: SUM = 0. DO100K=1,N SUM = SUM + A(J, K) * B(K, I) 100 CONTINUE C(I, J) = SUM . Faster loop: C(I, J) = 0. DO100K=1,N C(I, J) = C(I, J) + A(J, K) * B(K, I) 100 CONTINUE . Some processors vectorize reduction operations such as the above dot product c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (3)

Use unit-stride memory references in the innermost loop to take max- imum advantage of bank memory (if present), or if they will permit cache bypass: . Non-unit-stride innermost loop: DO20I=1,M DO20J=1,N A(I, J) = B(I, J) * C(I, J) 20 CONTINUE C(I, J) = SUM . Interchange loops to obtain unit stride: DO20J=1,N DO20I=1,M A(I, J) = B(I, J) * C(I, J) 20 CONTINUE . Some compilers perform this optimization

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (4)

On a system with interleaved memory, avoid using strides through memory that have common factors with the number of memory banks . Original loop: DIMENSION A(32, N), B(32, N) DO100I=1,32 DO100J=1,N A(I, J) = A(I, J) + B(I, J) 100 CONTINUE . The inner loop has stride 32, which causes a bank conict at every memory access on a machine with 32 banks (e.g., CRAY X-MPs). DIMENSION A(32, N), B(32, N) DO100J=1,N DO100I=1,32 A(I, J) = A(I, J) + B(I, J) 100 CONTINUE

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (5)

Unroll an outer loop if the processor has only one vector pipeline between memory and the CPU (e.g., the CRAY-1) . Unrolling can decrease conicts between loads and stores . Dongarra and Eisenstat’s SAXPY for y = Ax: DO20J=1,N DO10I=1,M Y(I) = Y(I) + X(J) * A(I, J) 10 CONTINUE 20 CONTINUE . Each instance of the inner loop requires 2 vector loads, a (possibly chained) multiply-add, and a vector store, or a total of 3 memory references for 2 FLOPs

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (6)

Unroll the outer loop of Dongarra and Eisenstat’s SAXPY to a depth of four: DO20J=1,N,4 DO10I=1,M Y(I) = (((Y(I) + X(J-3) * M(I, J-3)) * + X(J-2) * M(I, J-2)) * + X(J-1) * M(I, J-1)) * + X(J) * M(I, J) 10 CONTINUE 20 CONTINUE Now each instance of the inner loop makes 6 memory references for 8 FLOPs, so that on a single-vector-pipe processor the speed is twice that of the original loop

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (7)

Distribute a DO loop for partial avoidance of recursion . Original loop: DO20I=1,M A(I) = A(I-1) + B(I) * C(I) 20 CONTINUE . Distributed loop: DO10I=1,M T(I) = B(I) * C(I) 10 CONTINUE DO20I=1,M A(I) = A(I-1) + T(I) 20 CONTINUE . Some compilers perform such partial vectorizations automatically

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (8)

Distribute nested DO loops to enhance vectorization . Original loops: DO20I=1,M DO10J=1,N A(I) = A(I) + B(I, J) * C(I, J) 10 CONTINUE D(I) = E(I) + A(I) 20 CONTINUE . Distributed and reordered loops: DO20J=1,N DO10I=1,M A(I) = A(I) + B(I, J) * C(I, J) 10 CONTINUE 20 CONTINUE DO30I=1,M D(I) = E(I) + A(I) 30 CONTINUE c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (9)

Reorder loops to avoid multidimensional recursion . Original loops: DO100I=1,M DO100J=1,N A(I, J) = A(I, J-1) * B(I, J) 100 CONTINUE . Reordered loops: DO100J=1,N DO100I=1,M A(I, J) = A(I, J-1) * B(I, J) 100 CONTINUE

c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

VECTOR OPTIMIZATION TECHNIQUES (10)

Use strip mining to minimize vector load overhead . Original loop: DO100I=1,M DO100J=1,N A(I) = B(I) * C(I) 100 CONTINUE . Strip-mined loop (assuming a processor with vector registers of length 128): J=0 DO110K=M,0,-128 DO100I=1,128 A(I+J) = B(I+J) * C(I+J) 100 CONTINUE J=J+128 110 CONTINUE . If the upper limit M is not a multiple of 128, there will be some overhead in cleaning up the last part of the loop. c C. D. Cantrell (05/1999) Strip Mining

DO I = 1, 10000 A(I) = A(I) * B(I) ENDDO

Becomes...

DO IOUTER = 1, 10000, 1000 DO ISTRIP = IOUTER, IOUTER+999 A(ISTRIP) = A(ISTRIP) * B(ISTRIP) ENDDO ENDDO

Loop Interchange

DO I = 1, N DO J = 1, N A(I,J) = B(I,J) + C(I,J) ENDDO ENDDO

Becomes...

DO J = 1, N DO I = 1, N A(I,J) = B(I,J) + C(I,J) ENDDO ENDDO THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ARRAY DEPENDENCIES (1)

Reference to a value calculated in a previous instance of the loop . Nonvectorizable construction: DO100I=2,100 B(I) = A(I-1) A(I) = C(I) 100 CONTINUE . If all of the elements of A, B and C were operated on in groups, then execution of B(3) = A(2) would occur before execution of A(2) = C(2). . Vectorization is possible after interchanging the statements: DO100I=2,100 A(I) = C(I) B(I) = A(I-1) 100 CONTINUE . Some vectorizing FORTRAN compilers make such statement interchanges automatically c C. D. Cantrell (05/1999) THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineering and Computer Science

ARRAY DEPENDENCIES (2)

Reference to a value that should be calculated in a later instance of the loop . Nonvectorizable construction: DO100I=1,99 A(I) = C(I) B(I) = A(I+1) 100 CONTINUE . If all of the elements of A, B and C were operated on in groups, then execution of A(2) = C(2) would occur before execution of B(1) = A(2). . Vectorization is possible after interchanging the statements: DO100I=2,100 B(I) = A(I+1) A(I) = C(I) 100 CONTINUE . Some vectorizing FORTRAN compilers make such statement

c C. D. Cantrell (05/1999) interchanges automatically