Curriculum Builder Portal 2/11/10 8:54 AM

total day span:"109" . 1 Introductory activities 2009-1-20 ~ 2009-1-21 (7 activities) 1.1 Welcome to CS 61CL!(3 steps) 1.1.1 (Display page) CS 61C vs. CS 61AB CS 61C vs. CS 61AB

The CS 61 series is an introduction to computer science, with particular emphasis on software and on machines from a programmer's point of view. The first two courses considered programming at a high level of abstraction, introducing a range of programming paradigms and common techniques. This course, the last in the series, concentrates on machines and how they carry out the programs you write. The main topics of CS 61CL involve the low-level system software and the hardware organization of a "logical machine"—not the actual electronic circuits, but the computational operations that those circuits carry out. To make these ideas concrete, you will study the structure of a particular computer, the MIPS R3000 processor, in some detail, down to the level of the design of the processor's on-chip components. 1.1.2 (Display page) Course outline Course outline

Topics in the first half of CS 61CL will be covered roughly in the sequence specified below. week topics 1 introduction to 2 C arrays and pointers 3 dynamic storage allocation 4 introduction to assembly language 5 assembly language translation of C operations 6 machine language 7 floating-point representations Topics to be covered in the remainder of the course include the following:

input and output linking and loading logic design CPU design C memory management virtual memory caches

1.1.3 (Display page) A notebook for CS 61CL A notebook for CS 61CL

We encourage you to buy a notebook specifically for CS 61CL and bring it to lab every day. T.a.s will want a place to write answers for questions you ask. You will want a place to keep track of issues or techniques that arise in lab. The notebook is also a good place to keep track of errors you make, to help you not make them again. 1.2 Get oriented to the UC-WISE system.(1 step) 1.2.1 (Display page) Overview Overview

For those of you that are new to the UC-WISE learning environment, this document provides an overview. 1.3 Meet the C .(6 steps) 1.3.1 (Display page) Why C? Why C?

C arose from work on UNIX at Bell Labs in the late 1960’s. It filled two main needs. First, it is a relatively high-level language that in addition allows flexible low-level access to memory and operating system resources. It's also relatively easy to write a compiler for, so it wasn't too long before implementations of C were available on a variety of computers. (This contributed greatly to the spread of UNIX.) In CS 61CL, we'll be focusing on two aspects of C. First, there are things you need to be an everyday C programmer; you'll write some of the CS 61CL projects in C, so we hope these hints will help you with those programs. Also, C allows easy access to other aspects of the CS 61CL content. 1.3.2 (Display page) C vs. Java C vs. Java

To a first approximation, C is just Java without the object-oriented features. Here's a Java program that veterans of CS 61B have probably encountered.

public class Account​ { public Account (int amount) {​ myAmount = amount;​ } ​ public void deposit (int amount) { http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 1 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

​ myAmount = myAmount – amount; ​ } ​ public int balance ( ) { ​ return myAmount; ​ } ​ private int myAmount; ​ }

Crossing out mentions of class and public/private leaves an almost legal C program:

1.3.3 (Display page) Format of a C program Format of a C program

Here's an excerpt of a program we'll see later in this lab session, along with some explanatory notes.

#include

int main ( ) { unsigned int exp = 1; int k;

/* Compute 2 to the 31st. */ for (k=0; k<31; k++) { exp = exp * 2; } ... return 0; } Notes:

#include is loosely similar to the import construct in Java. Mechanically, the C compiler replaces the #include line by the contents of the named file. However, well-designed include files are an important aspect of good C programming style. Important system programming interfaces are described in < > include files. (Surrounding the file name with angle brackets, as in the example, tells the compiler to look for the file in a set of system directories.) User- defined data types and interfaces, such as you will write, are described in " " include files. (Surrounding the file name with double quote marks says to look in the user's directory for the file.) As in Java, main is called by the operating system to run the program. Unlike in Java, main is an int-valued function that returns a success-failure code (0 means success, anything else means failure). Command-line arguments may but are not required to be specified as arguments to main. The C compiler also allows main to be declared as void; K&R contains numerous examples of this. C requires declaration before use; moreover, all declarations must appear after the left brace surrounding a group of statements. C uses "/*" and "*/" as comment delimiters as does Java. More recent versions of C also allow "//" comments as in Java.

1.3.4 (Display page) Another example Another example

Here's another example that we'll revisit later in this lab session.

#include

int bitCount (unsigned int n);

int main ( ) { printf ("%d %d", 0, bitCount (0)); printf ("%d %d", 1, bitCount (1)); printf ("%d %d", 27, bitCount (27)); return 0; }

int bitCount (unsigned int n) { int k; return 13; } Notes:

Functions should also be declared before they're used. In C, that's done with a function prototype; the line right after the #include is an example. Files whose names end in ".h" (for "header") typically consist of function prototypes and data type definitions. Files whose names end in ".c" contain the code for those functions. The printf function is used to produce output. Its first argument is a format string. Each occurrence of "%" in the format string indicates where one of the subsequent arguments will

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 2 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

appear and in what form. This association between form and variable type isn't checked by the C compiler, as we'll note later.

1.3.5 (Display page) Some troublesome details Some troublesome details

C doesn't have a boolean type. It uses int instead: 0 means "false", anything else means "true". (This is similar to Scheme.) This feature, combined with assignment within expressions and a looser approach to types in general, leads to a common mistake made by beginning C programmers: using "=" instead of "==" in a comparison causes C to interpret the comparison as an assignment statement, whose value is the value assigned. Thus, a typical assignment statement might be:

n = 5; and a typical conditional might be:

if (n == 5) { ... do something ...} But the following is a funny combination of both:

if (n = 5) { .. do something ...} It assigns the value 5 to the variable n and tests whether 5 is not equal to zero. (It's true. 5 is not 0.) The test always succeeds, probably not what the programmer intended. In Java, this causes an error, since the condition in an if statement must be of type boolean. And yes, C programmers do find elegant uses of the assign-and-test idiom, but in moderation. For new C programmers, it is usually a bug. C doesn't check for the failure to initialize variables, or for accessing outside the bounds of an array. These omissions have been the bane of many a C programmer over the years. They are best addressed by good, disciplined programming conventions right from the beginning. Although simple, C is very powerful and allows you to get at anything the machine can do. Without care, that power can be dangerous. 1.3.6 (Brainstorm) C vs. you Are these aspects of C—allowing assignments in comparisons, not checking that variables are initialized, and not checking for out-of-range accesses to an array—likely to prove troublesome for you? If not, why not? If so, what can you do to help yourself avoid these errors? 1.4 Explore the programming environment.(6 steps) 1.4.1 (Display page) Review of some UNIX commands Review of some UNIX commands

First, use the mkdir command to create a directory named lab1 in your home directory. The mkdir command takes a single argument, the name of the directory to create. Then use the cp command to copy two files from the CS 61CL code directory to the lab1 directory. The cp command takes two arguments:

the file to be copied, and the destination file or directory.

The files to be copied are ~cs61cl/code/wc1.c and ~cs61cl/code/wc.data.txt. Incidentally, if you're new to UNIX, this web page contains short descriptions of some more useful UNIX commands. The most important is man. If you are not sure how a UNIX command works, run man on it. (Yes, man man works too.) 1.4.2 (Display page) UNIX i/o redirection UNIX i/o redirection

The default input source for many UNIX programs is the keyboard. However, this source may be redirected to come from a file, using the < operator. For example, a desk calculator program named dc simulates a calculator that takes input in postfix order (operands before operator). When one runs dc with the command

dc input comes from the keyboard. For example, to compute 2 + 3, one would type

2 3 + p

("p" prints the result.) To take input instead from a file named dc.cmds, we would type

dc < dc.cmds 1.4.3 (Display page) gcc, gdb, and emacs gcc, gdb, and emacs

In lab, we'll be using the gcc C compiler and the gdb debugger. The emacs editor provides features that enhance the capability of gdb, with which you'll shortly get some practice. Check right now that the gcc command is set up properly for your account, by typing the alias command to the UNIX shell. Among the entries printed should be one for gcc that includes the options "-Wall", "-g", and "- std=c99". "Wall" means "Warn about all recognized irregularities"; "-g" causes gcc to store enough information in the executable program for gdb to make sense of it; "-std=c99" enables the use of features added to C in the 1999 standard. Warn your instructor if these options do not appear in an alias entry for gcc. Let's try it out. Type gcc. It should come back with gcc: no input files. That's right; you didn't specify any. If it comes back with command not found, you have a problem to sort

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 3 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

out. Run it on the input file that you just copied over: gcc wc1.c Where did the output go? It should have created an executable file—your program. What UNIX command do you use to find this file? How do you run it? What does it do? Ask a labmate if you're not sure of the answers to these questions. gdb is another executable. You can run your program under gdb by typing gdb nameOfAnExecutableFile. To write your own programs, you'll need an editor. If you're unfamiliar with emacs, or you've forgotten what you learned about it in past CS classes, there should be a "quick reference" to emacs document somewhere near your workstation. This document is also available online here. Don't hurry through it. Knowing the tools well will save you lots of time later. All three of these tools come together to make an integrated programming environment. Another document that's available in lab is "Introduction to Using GDB Under Emacs". Take a few minutes to read this document. 1.4.4 (Display page) A buggy program A buggy program

The program below, which you should have copied to your lab1 directory, is intended to mimic the behavior of the wc (Word Count) program supplied as part of the standard UNIX distribution. Giving the command

wc -w < fileName prints the number of words—essentially sequences of non-space characters—in the named file. For example, if a file named data contains the two lines

a bc def ghi j then the command

wc -w < data

should produce 5 as output. (Type man wc for more information.) Here's the buggy program.

#include #include

int main ( ) { int wc; char c; for (wc=1; ; wc++) { /* We're about to encounter the wc'th word. Skip leading white space. */ while ((c = getchar ( )) && c != EOF && isspace (c)) { } /* Read through characters of the word. */ while ((c = getchar ( )) && c != EOF && !isspace (c)) { } } printf ("%d\n", wc); return 0; } A few notes:

This program processes character data rather than integers. The getchar function returns the next character from the input source. If no more characters remain, getchar returns the special character EOF. The isspace function is part of the ctype (short for "character type") library and is accessed via the line

#include near the top of the program.

We hope that nothing else in the program is mysterious. If you are unsure of some aspect of the program, ask a labmate. 1.4.5 (Display page) Using the gdb debugger Using the gdb debugger

First, run the wc1 program on the sample data file:

gcc -Wall wc1.c a.out < wc.data.txt

The program has an infinite loop Type control-C to regain control. To help find out what's wrong with the program, do the following:

1. Compile the program and run it with gdb as described in the "An Introduction to Using GDB Under Emacs" document. This will involve splitting your emacs window in half, then running gdb in the bottom half. Here are the details of doing this. a. Edit the wc1.c file with emacs. b. Split the window using control-x 2. c. In the code window, type meta-x compile; backspace over the make command, and replace it with gcc -Wall -g wc1.c. d. In the other window, where you want to run gdb, type meta-x gdb, then complete the command by typing a.out. ("a.out" is the default name of the executable file produced by gcc.) e. Move the cursor to the gdb window, click on the command prompt, and type run < wc.data.txt. Observe the program's behavior. Type control-c to regain control. f. Go back to the code window, and continue with step #2 below. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 4 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

2. Click on the line after the for loop. Set a breakpoint at that line using control-x space, then run the program again in gdb. Type the continue command (you can use "c" to abbreviate it) four times. Then use the next command to single-step through the program to determine what's going wrong. Identify and fix the error. Another way to set a breakpoint is to give the "b" command at the "(gdb)" prompt in one of the following two ways. (Use the "list" command to determine line numbers.)

b function_name b line_number 3. Recompile and run the correct program.

In the next step, explain the bug you found and how you fixed it. 1.4.6 (Brainstorm) Analyzing program behavior What was wrong with wc1.c, how did you figure it out, and how did you fix it? 1.5 Consider the representation of int values.(9 steps) 1.5.1 (Display page) Representation of integers Representation of integers

We will now examine standard representations of nonnegative integers, to make clear the concept of a value "fitting in" a memory location. First, we distinguish between a value and the representation of that value, i.e. how it is communicated from one person to another. The integer value 27 may be represented using decimal digits, or Roman numerals (XXVII), or words ("twenty-seven"), or an arithmetic expression (6*3+9). We are accustomed to using the decimal system for representing numbers. In this system, we represent a value with a sequence of decimal digits, each being one of 0, ..., 9. The position of a digit in this sequence is important: 758 represents a different value from 857. In fact, each position is associated with a weight. The weights and the digits are combined arithmetically to form the represented value. Finally, it's a based number system, in that each weighting factor is a power of some base value (also referred to as the radix). In decimal, the base is 10. A decimal representation of an integer M is then a sequence of decimal digits

dn-1 dn-2 ... d1 d0 such that

n-1 n-2 1 0 M = dn-1 × 10 + dn-2 × 10 + ... + d1 × 10 + d0 × 10

For example, 758 represents 7 × 102 + 5 × 101 + 8 × 100. Inside a computer, it is more convenient to use a base that's a power of 2, for example, base binary "digits" are 0 and 1 2 base octal "digits" are 0, ..., 7 8 base "digits" are 0, ..., 9, A (representing 10), B (representing 11), hexadecimal 16 C, D, E, F (representing 15) For example, the value that 45 represents in base 10 is represented in binary as 101101, in octal as 55, and in hexadecimal as 2D, since

45 base 10 = 1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 5 × 81 + 5 × 80 = 2 × 161 + 13 × 160 1.5.2 (Self Test) Representing values in different bases Give the decimal representation of the value represented in hexadecimal as A5.

Give the ternary (base 3) representation of the value from question 1.

1.5.3 (Display page) A joke that's part of the CS culture A joke that's part of the CS culture

Click on this link for a number representation joke. 1.5.4 (Display page) Conversion between bases that are powers of 2 Conversion between bases that are powers of 2

Any representation using a base that's a power of 2 can easily be converted to binary, by writing out the individual digits in binary. For example:

2B3 base 16 = 0010 1011 0011 base 2 a To convert from binary to a base that's 2 , just write the digits in groups of a and translate each group to its corresponding digit base 2a. For example, we can translate from binary to octal, base 23, as follows:

001010110011 base 2 = 001 010 110 011 base 2 = 1263 base 8 1.5.5 (Display page) Why powers of 2 are interesting Why powers of 2 are interesting

For reasons that you would learn about in an electronics course (e.g. EE 40 or EE 42), computer memories are organized into collections of circuits that are either "on" or "off". For storage of numeric http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 5 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

information, this naturally suggests the use of base 2, with the binary digit (bit) 1 representing "on" and 0 representing "off". Typical units of computer memory are a (8 bits; also used to represent a char in C), a word (32 bits; used to represent an int in C), a halfword (16 bits; used to represent a short int in C or a Unicode character), and a double word (64 bits). With n bits, you can represent 2n different values. Thus with 8 bits, you can represent 256 = 28 different values; 10 bits 10 lets you represent 1024 = 1 "K" = 2 different values. An int in C can take on only a limited number of different values. When we try to construct an integer value that's outside that range, and store it in an int, surprises may result. 1.5.6 (Self Test) Binary fingers All your life you have been counting in unary (base 1) on your fingers, so even with two hands you can only count to ten. How high can you count on one hand in binary? Can you count from zero to that number quickly? (Watch out for the number 4.) How high can you count with two hands?

1.5.7 (Display page) Experimenting with conversion between bases Experimenting with conversion between bases

Given below is a program that, given a binary representation typed by the user, prints the decimal equivalent. It's in the file ~cs61cl/code/bin2dec.c. Copy the program to your lab1 directory. The program is missing one statement. Supply it, then test the program. It may help to work through how you would produce its output by hand, if given the digits of a binary representation left to right.

#include

int main ( ) { int n; char c; while (1) { /* Translate a new line to decimal. Assume that the only characters on the line are zeroes and ones. */ n = 0; c = getchar ( ); if (c == EOF) { break; } while (c == '1' || c == '0') { /* The missing statement goes here. */ c = getchar ( ); } printf ("in decimal is %d\n", n); } return 0; } 1.5.8 (Display page) More debugging More debugging

The program ~/code/checkPower2.c declares and initializes a variable n, then checks whether its value is a power of 2. Copy the program to your lab1 directory and run it to verify that 32768 is a power of 2 (it's 215). Then test the program on several other values:

1. 65536 (a power of 2) 2. 1 (another power of 2) 3. 32767 (not a power of 2) 4. 3 (also not a power of 2)

Finally, test the program on the value 2147483649, which is 1 plus 231, the largest power of two that fits in an integer. Single-step through the computation to find out what happens. In the next step, explain what happened and why. 1.5.9 (Brainstorm) Identifying the bug What was the bug, and how did you find it? 1.6 Use program code to examine internal representations.(8 steps) 1.6.1 (Display page) Printing an int in base 2 Printing an int in base 2

The program below, also in ~cs61cl/code/buggy.base2.print.c, is intended to print the base 2 representation of the unsigned value stored in the variable numToPrintInBase2. It has bugs. Fix them.

#include int main ( ) { unsigned int numToPrintInBase2 = 2863311530; /* alternating 1's and 0's */ unsigned int exp = 1; int k; /* Compute the highest storable power of 2 (2 to the 31th). */ for (k=0; k<31; k++) { exp = exp * 2; } /* For each power of 2 from the highest to the lowest, print 1 if it occurs in the number, 0 otherwise. */ for (k=31; !(k=0); k--) { if (numToPrintInBase2 >= exp) { putchar ('1'); numToPrintInBase2 = numToPrintInBase2 - exp; } else { putchar ('0'); } exp = exp / 2; } http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 6 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

putchar ('\n'); return 0; } 1.6.2 (Brainstorm) Identifying the bug(s) What were the bugs in buggy.base2.print.c, and how did you find them? 1.6.3 (Display page) Printing with gdb Printing with gdb

Of course, you don't need to write your own programs to print the contents of memory. For example, gdb includes a general printing facility. The command

print /representation variable prints the value of the given variable, using the given representation. Among the choices for representation are the following: u unsigned decimal integer x hexadecimal o octal t binary For example,

print /x n

would print the contents of n in hexadecimal. Experiment with this gdb feature, using the simple program below. Set a breakpoint at the return statement, then change values via the print feature you used in the last lab, for example:

print n = 25022 Simple program:

int main ( ) { int n = 70; return 0; } 1.6.4 (Display page) The char data type The char data type

C's char data type represents ASCII characters. (This table lists them all.) A char is treated by C as just a small integer. One may do arithmetic on char values. When an int is used where a char is expected, the excess leftmost bits are discarded. Use the online ASCII table to answer the questions in the next step. 1.6.5 (Self Test) Experimenting with characters What gets printed by the following statement?

putchar (70);

What numerical argument (i.e. an argument without quotes) should be passed to putchar so that the program segment below prints a line containing a 0?

putchar ( ___ ); putchar ('\n');

1.6.6 (Display page) Using the od program Using the od program

The od program interprets its input according to the command line arguments, and prints its interpretation. Recall the file wc.data.txt. Here are the results of running od on that file. od command od output notes 0000000 141 040 142 143 040 144 145 146 012 040 040 147 150 151 040 152 prints each od -b < wc.data.txt 0000020 040 012 byte in octal 0000022 (base 8). prints each 0000000 a b c d e f \n g h i j byte as an od -c < wc.data.txt 0000020 \n 0000022 ASCII character. 0000000 24864 25187 08292 25958 02592 08295 26729 08298 prints each od -d < wc.data.txt 0000020 08202 pair of bytes 0000022 in decimal. prints each 0000000 6120 6263 2064 6566 0a20 2067 6869 206a pair of od -x < wc.data.txt 0000020 200a in 0000022 hexadecimal (base 16).

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 7 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

The seven-digit hexadecimal value at the start of each line gives the position of the byte in the file. man od will give you more information about how the program works. The program ~cs61cl/code/mysteryout apparently produces a blank line as output. Find out what it really prints using od. Use the UNIX "pipe" facility to send the output of mysteryout to od:

mysteryout | od options 1.6.7 (Brainstorm) Characters on a "blank" line? What characters were actually printed by the mysteryout program? 1.6.8 (Display page) A powerful idea for CS 61CL A powerful idea for CS 61CL

In Java, data of different types are strictly separated. In general, one can use only integer operations on integers, only string operations on strings, and so forth. As we descend through layers of abstraction, however, we lose the notion of types. Data in a low-level computing environment are just collections of bits. The meaning of those bits can't be determined just by looking at them, for the same bits can represent a variety of objects: numbers, characters, even instructions! The identical treatment of instructions and data—the stored program concept—was a big breakthrough in the early history of computing. The first computers literally had "hard-wired" instructions; to change the program that the computer was running, one had to rewire the computer! The treatment of instructions as data, and the implementation of a general for those instructions, enabled programmers to climb to higher levels of programming abstraction. 1.7 Homework(5 steps) 1.7.1 (Display page) Homework summary Homework summary

For homework for next lab, finish today's lab activities, do some reading in K&R, write two programs to be submitted online (bitCount and an adding machine), and complete an introductory survey. This may seem like a lot of work to do in two days. We hope, however, that your previous experience with a C-based language will soften the impact of this assignment. (We're also assuming that you don't have much work yet in your other courses.) Directions for online submission of your solutions to the bit-counting and adding machine programs will appear within a day or two. 1.7.2 (Display page) Readings Readings

Read the following in K&R for next lab:

sections 1.1-1.9; all of chapter 2; sections 3.1-3.7; and sections 4.1 and 4.2.

1.7.3 (Display page) A bitCount function A bitCount function

Write a function named bitCount that returns the number of 1-bits in the binary representation of its unsigned integer argument. Add your function to the following program, which is available online in ~cs61cl/code/bitcount.c, fill in the identification information, and run the completed program.

#include

int bitCount (unsigned int n);

int main ( ) { printf ("# 1-bits in base 2 rep of %u = %d, should be 0\n", 0, bitCount (0)); printf ("# 1-bits in base 2 rep of %u = %d, should be 1\n", 1, bitCount (1)); printf ("# 1-bits in base 2 rep of %u = %d, should be 16\n", 1431655765, bitCount (1431655765)); printf ("# 1-bits in base 2 rep of %u = %d, should be 1\n", 1073741824, bitCount (1073741824)); printf ("# 1-bits in base 2 rep of %u = %d, should be 32\n", 4294967295u, bitCount (4294967295)); return 0; }

/* Your bit count function goes here. */ 1.7.4 (Display page) An adding machine An adding machine

Write a C program named adder.c that simulates an adding machine. Input to the program will represent integer values, typed one per line without leading or trailing white space. (Don't worry about incorrectly formatted input.) Input should be handled as follows:

a nonzero value should be added into a subtotal; a zero value should cause the subtotal to be printed (in the form "subtotal" followed by a single space followed by the subtotal) and reset to zero; two consecutive zeroes should cause the total of all values input to be printed (in the form "total" followed by a single space followed by the total) and the program to be terminated.

Your solution must use the getchar function to read input; it must not use scanf or any of its variations. You are forbidden to use arrays or strings. Don't provide any prompts; the only output your program should produce is lines of the form "subtotal __" and "total __". Add a comment at the

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 8 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

start of the program that lists your name and lab section.

Example

user input program's output 5 6 0 subtotal 11 -3 -2 5 0 subtotal 0 13 -13 8 0 subtotal 8 0 total 19 This problem is a bit tricky. In particular, you should make sure it works correctly when the first two input values are both 0.

Entire day brainstorm answers"" 2 Arrays and strings 2009-1-22 ~ 2009-1-23 (4 activities) 2.1 Continuing with wc implementations(3 steps) 2.1.1 (Display page) Character and line counting Character and line counting

In the previous lab, you worked with wc, one of a bunch of handy utilities that consume characters from stdin, do some useful processing, and produce characters for stdout. You can run these from the command shell and type at them, or you can redirect files to them, or you can chain them together in a processing pipeline. Try out the utility wc—remember man wc—by seeing how many lines of code there are in the wc1.c program you debugged in the last lab. wc is most useful on a file, e.g., wc < wc1.c to count the lines, words and characters in the file wc1.c. In the last lab, you debugged code that implemented the word-counting part of wc. Now we'll work on the rest of it. Here's a first step, in ~cs61cl/code/wc2.c.

#include

int main ( ) { int lines = 0; /* Initialize line count */ char c = getchar( ); /* Declare and init to first char */ while (c != EOF) { if (c == '\n') lines++; c = getchar( ); } printf(" %d\n",lines); }

Modify this to print the character count in addition to the line count. If you're tired of a.out, you can specify the name of the executable file with

gcc wc2.c -o wc If you have two versions of an executable file with the same name, the search path will determine which takes priority. Type

echo $PATH to display the search path, and

which wc

to display which of the two versions of wc will be run if you just type

wc wc1.c To make a long story short, you should type

./wc < wc1.c to run the version in your own directory if your search path would otherwise select the system version to run. 2.1.2 (Display page) And now a word ... And now a word ...

That was easy. Words are harder. As you scan the input you will need to keep some state to tell if you are at the start of a word, still in the middle of one, at the end, or in the middle of some non- word junk. And what is a word anyway? Here's a starting point, which also gives you an answer to the previous exercise. The code is in ~cs61cl/code/wc3.c.

#include

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 9 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

enum bool {FALSE=0, TRUE};

int startWord (int inword, char c) { return TRUE; // this is a lie } int endWord (int inword, char c) { return FALSE; // so is this }

int main () { int lines = 0; /* Init the counters */ int bytes = 0; int words = 0; int isword = FALSE; char c = getchar ( ); /* Prime the loop */ while (c != EOF) { bytes++; if (startWord (isword,c)) { /* Start of a word? */ words++; isword = TRUE; } else if (endWord (isword,c)) { isword = FALSE; /* End of a word? */ } if (c == '\n') lines++; c = getchar ( ); } printf(" %d %d %d\n",lines, words, bytes); }

Note that to improve readability we introduced some handy literals with enum. FALSE has to be 0. Your job is to study this and write the predicates startWord and endWord. Remember from last time that you can do comparisons on characters. In fact this is so common that C has some nice libraries to do it, for example, the ctype library you used last lab, described in Appendix B2 of K&R. 2.1.3 (Brainstorm) What's EOF? What's EOF, and how is it distinguished from a character value? Describe how you determined your answer. 2.2 Character arrays and strings(10 steps) 2.2.1 (Display page) C arrays C arrays

An array in C is a contiguous sequence of values of a common type. The simplest of these is an array of characters. Since an ASCII character can be represented using a single 8-bit byte, an array of chars is just a sequence of bytes. A C array is rather more primitive than an array object, as it does not explicitly record the length, for example, nor does it check that the bounds are respected. It simply holds a sequence of values. (Safe programming takes discipline.) As with arrays in other languages, you access a particular element by specifying an index in square brackets, e.g., dwarf[k]. C arrays are indexed starting at 0, so the first dwarf would be dwarf[0] and the last element of an array of seven dwarves would be dwarf[6]. 2.2.2 (Display page) Out-of-bounds array access Out-of-bounds array access

Experiment with the code below, available in ~cs61cl/code/outOfBounds.c. In particular, supply an assignment to an element of a1 that changes an element of a2.

#include

int main ( ) { int a1[4]; int k; int a2[4]; for (k=0; k<4; k++) { a1[k] = k+1; a2[k] = 2*(k+1); } a1[5] = 999; a2[5] = 762; printf ("k = %d\n", k); printf ("elements of a1 = %d %d %d %d\n", a1[0], a1[1], a1[2], a1[3]); printf ("elements of a2 = %d %d %d %d\n", a2[0], a2[1], a2[2], a2[3]); return 0; }

2.2.3 (Brainstorm) Results of the experiment Explain why the out-of-bounds array assignments did what they did. Also explain how you found the assignment to a1 that changed an element of a2. 2.2.4 (Display page) Representing character strings Representing character strings

Sequences of characters appear in many places. Text files are sequences of characters. We read sequences of characters from the terminal. In both cases, we often pay special attention to the line breaks. When you type at the terminal, the enter key typically causes a command to execute. It is conveyed to the machine as a newline ('\n') (or sometimes the two-character combination "\r\n", which stands for carriage return, line feed). In some documents the end-of-line is very important, in others it is immaterial. The C language does not take any position on these issues. You can put any sequence of characters you like into a char array. In C, a string is a particular kind of char array that contains a null character ('\0') that signals the end of the string. The null character need not be the last element of the array. Wherever it appears—and that had better be within the bounds of the

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 10 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

array—it terminates the string. (Contrast the null character with the newline ('\n'), which has no special meaning. If you happen to print it, it will advance to a new line, but it isn't regarded as "terminating" a string that contains it.) The use of a terminating character, rather than an explicit length, is a design choice in the language. In C, it is extremely simple to pass around references to strings (called pointers, but we'll talk about that later). You can figure out the length by running down the array until you find the terminator. It also means that if you forget to put in the terminator, functions that operate on the string may run off the end and go wandering around all through memory—not a good thing. (Did we mention that safe programming takes discipline?) Oh yes, it is precisely this ability to access memory that makes C so useful for systems code. There is a wonderful set of library routines for strings (described in Appendix B3 in K&R). You will find that you use these all the time. 2.2.5 (Display page) checkPower2 with strings checkPower2 with strings

The following code provides a template for a more useful version of your power-of-two checker.

#include #include #define MAXLINE 80

int test (unsigned int n) { /* Replace this with an accurate test. This one is too negative. */ return 0; }

char prompt ( ) { printf ("> "); return getchar ( ); }

int main ( ) { char in[MAXLINE]; unsigned int n; char c; int i; c = prompt(); while (c != EOF) { i = 0; while (c != '\n') { in[i++] = c; c = getchar(); } in[i] = '\0'; n = atoi (in); if (test (n)) printf ("%u is a power of 2\n", n); else printf ("%u is not a power of 2\n", n); c = prompt(); } }

We have declared in to be a char array of length MAXLINE. (In C, you will want to use #define to provide meaningful symbolic names for constants.) This allocates storage for the array and declares the type of its elements. This is essentially a placeholder for a string. It doesn't have anything in it yet. The function prompt illustrates a string literal. This is a character array of three chars, '>', ' ', and '\0'. It is an array of length three containing a string of length 2. The terminator doesn't count. Note that there is no newline in this string, since the input should appear on the same line as the prompt. This program reads sequences of characters from stdin (the keyboard), creates a string for each line, converts the string to an integer value and tests if the value is a power of 2. It reads the input character by character until it reaches a newline. It uses a very useful function, atoi, which takes a string as input and returns an integer value that the string represents. It will skip over initial whitespace to find a digit. This code is online in the file ~cs61cl/code/newCheckPower2.c. Using what you learned in the last lab, fill in the test function so that it correctly reports a result to main. Also, experiment with the program. Are leading and trailing blanks handled correctly? What about nondigits? (The next step asks about the latter case.) 2.2.6 (Brainstorm) atoi's behavior with nondigits? What does atoi return when given as an argument a string containing a nondigit? Suggest a reason for its behavior. 2.2.7 (Display page) Writing strings as lines Writing strings as lines

The following example shows some useful aspects of strings.

#include #include

void putline (char str [ ]) { int i = 0; while (str[i]) putchar (str[i++]); putchar ('\n'); }

int main ( ) { char msg [ ] = "life is short"; putline (msg); putline ("Eat dessert first"); }

Notice in main that the length of the msg array wasn't provided in the declaration, because the C compiler can infer the length from the length of the initializing string constant. msg is an array of http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 11 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

characters holding a string. What size is it? Big enough to hold the string and its terminator. The program also uses a string literal in the last call to putline that is not assigned to any variable. We can pass either of these to putline as a char array. Interestingly, putline doesn't know the length of its argument. It assumes that it is a properly terminated string, so it prints characters until it finds the terminating '\0'. More items of interest:

putline uses a common C idiom that takes advantage of the string terminator, the boolean false, and the integer zero all being the same thing. We could have used while (str[i] != '\0') just as well. The index is post-incremented. That is, the value returned by the expression i++ is the value i has before it is incremented. Some programmers view this usage as asking for trouble with off-by-one errors; you should decide for yourself whether to use the less tricky

while (str[i]) { putchar (str[i]); i++; } The result is that we don't output the terminator. (It happens to be a non-printing character.) However, we do add a newline.

The code above is in putline.c. Run it (with some modification) to determine the following:

Is the length of the msg string what we claimed above? Use the strlen function declared in string.h to find that out. Are there any nonprinting characters in the output? Use od or your wc program to decide this.

2.2.8 (Display page) Reading lines as strings; cat Reading lines as strings; cat

Let's write a version of the UNIX utility cat, which copies lines from stdin to stdout. We already have the output side. Here's a skeleton of the rest.

#include

#define MAXLINE 1024 // that should be big enough

int getline(char str[], int len) { /* ... you get to write this part */ }

void putline (char str[]) { int i = 0; while (str[i]) putchar (str[i++]); putchar('\n'); }

int main ( ) { char in[MAXLINE]; while (getline(in,MAXLINE) > 0) putline(in); }

getline is a lot more cautious than putline. It might be reasonable to insist that all strings that are passed to putline are properly terminated. (They aren't strings otherwise!) But input from the outside world has to be considered suspect until proven otherwise. The len argument is the length of the array that is provided to hold the string. The return value is the length of the string that was placed in this array. If a line happens to be longer than the array's capacity (maybe the whole file is a single line!) it should be broken up or truncated. The code is in ~cs61cl/code/cat.c. Complete the getline function and validate that it works for keyboard input and for files redirected into it. Your test file should include

an empty line a line with fewer than MAXLINE characters a line with exactly MAXLINE characters (modify the initialization of MAXLINE to do this test) a line with more than MAXLINE characters

2.2.9 (Brainstorm) Comparing getlines Compare your version of getline to the one that appears on page 69 of K&R, and defend your preference. In determining which of the two you prefer, consider the following questions: Which one is more elegant? For which is it easier to convince yourself that it works? What do you think about the use of assignment within the loop predicate? 2.2.10 (Brainstorm) Another getline Here's another version of getline. How does it compare with the other two for elegance and clarity? Briefly justify your answer.

int getline(char str[], int len) { int i; char c; for (i = 0; i < len; i++) str[i] = '\0'; /* Initialize array to string terminator */ for (i = 0; i < len; i++) { c = getchar(); if (c == '\n') return i+1; /* EOL, newline replaced with string terminator */ if (c == EOF) return i; str[i] = c; /* Add a char to the line */ } str[len-1] = '\0'; /* Line too long; terminate it */ return len; } http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 12 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

2.3 Arrays of numbers(4 steps) 2.3.1 (Display page) Getting started with integer arrays Getting started with integer arrays

Any data type that can be defined in C can be extended into an array of things of that type—not just array of chars, but array of ints, or array of unsigned ints, or arrays of arrays of chars and so on. We operate on all of these in essentially the same way. They can be initialized, indexed, accessed, or filled in. Strings are the only arrays with a special termination element. Generally, for all other array types the program keeps track of the bounds. The array is a container that holds a contiguous sequence of its elements. So let's take the next step to ints. Here's a little example.

#include

void printIntArray(int v[], int len) { int i; printf("["); for (i=0; i

int main(){ int i; int v1[]={10,9,8,7,6,5,4,3,2,1}; int v2[9];

for (i=0; i<9; i++) v2[i] = v1[i]+v1[i+1]; printf("v1 = "); printIntArray(v1,10); printf("v2 = "); printIntArray(v2,9); }

The code is in ~cs61cl/code/intarray1.c. Notice that when we declare an array we specify its size. We can initialize it at the point of declaration. If we don't specify a size and we provide an initialization the size is defined by the initialization. (Yes, we could have used int v1[10]={10,9,8,7,6,5,4,3,2,1};). We can pass an array to a function. The function can operate on any size array, but we need to also pass the length. 2.3.2 (Display page) The sizeof operator The sizeof operator

C contains an operator named sizeof that, given a type or a variable as its argument, returns the number of bytes taken up by the given variable or an element of the given type. For example, sizeof(k) will return the number of bytes used to store the value in the variable k, and sizeof(char) returns the number of bytes used to store anything of type char. sizeof also works with arrays. Here is code (in ~cs61cl/code/intarray2.c) that computes the length argument to printIntArray rather than hard-coding it in the call.

#include

void printIntArray(int v[], int len) { int i; printf("["); for (i=0; i

int main(){ int i; int v1[]={10,9,8,7,6,5,4,3,2,1}; int v2[9];

for (i=0; i<9; i++) v2[i] = v1[i]+v1[i+1]; printf("sizeof v1 = %d\n",sizeof(v1)); printf("length v1 = %d\n",sizeof(v1)/sizeof(int)); printf("v1 = "); printIntArray(v1,sizeof(v1)/sizeof(int)); printf("v2 = "); printIntArray(v2,sizeof(v2)/sizeof(int)); }

Notice the difference between the sizeof an array and the length of the array. The array occupies space equal to the number of elements times the size of the elements. 2.3.3 (Display page) Type dependence Type dependence

Notice also that while we can easily write a function to operate on any size of array, that does not mean that it will operate on any type. printIntArray specifies the type of the elements—ints. There are cases where we can write functions that exhibit "type polymorphism" but you need to know what you are doing. Add a function named copy to intarray2.c that copies array source to array dest, both of length len. Here's the header.

void copyIntArray (int dest[], int source[], int len) { ... } 2.3.4 (Display page) Sorting an array Sorting an array

Add a function named sort to the program. Given an integer array and its length, sort arranges the array elements into ascending order. Use the insertion sort algorithm for this purpose: http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 13 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

for (k=0; k

Homework for the weekend is to finish assignment 1 (it's due Monday night at 11:59pm), complete the online survey, and catch up on lab exercises you haven't yet completed. Instructions for submitting the programming assignment will be posted on Friday evening. Homework for next week is a program to process lines; it will be part of the first CS 61CL project. It's described in a subsequent step. 2.4.2 (Display page) Reading Reading

We'll be working with C pointers, files, and structures in next week's labs. Sections 5.1 through 5.10, 6.1 through 6.3, and 7.5 in K&R will provide useful background. 2.4.3 (Display page) Homework for next week Homework for next week

This assignment is due Friday night, January 30, at 11:59pm. It's an individual assignment; do your own work. Submission instructions will appear shortly. Write a program named processLines.c that processes input lines as follows.

1. After reading a line, it removes the character '#' and any characters that follow it from the line. If after that removal, the line consists entirely of white space, it is discarded. 2. A line containing some non-whitespace is assigned a number. Numbers start at zero and increase by one for each nonblank line. 3. If any word on the line ends with a colon (':'), the word is printed, followed by a tab (\t), followed by the current line number, followed by a newline.

Here's an example, with input given on the left and explanation on the right. foo foo is on line 0 # the result of removing # and subsequent chars leaves a blank line a: a: appears on line 1 bbb: bbb: appears on line 2 two blank lines are ignored

b c: # hello there b and c: are on line 3 d: d: is on line 4 hello world # xyz hello world is on line 5 e: e: is on line 6 The output is

a: 1 bbb: 2 c: 3 d: 4 e: 6 You may assume that lines are at most 80 characters long and words are at most 20 characters long. 3 References in C; introduction to files 2009-1-27 ~ 2009-1-28 (4 activities) 3.2 References and pointers(14 steps) 3.2.1 (Display page) Containers Containers

So far we've worked with C containers. A container is drawn as a box as shown below ; it has a name and contents.

int n; n = 4;

3.2.2 (Brainstorm) The meaning of assignment Suppose the following declarations have been supplied:

int n1 = 3; int n2 = 14; int n3 = 14;

What's the meaning of n2 = n1? Does execution of this statement affect the contents of n3? Why or why not? 3.2.3 (Display page) Arrays as containers Arrays as containers

Arrays are containers too. When we say

char str[10] = "go bears"; int values[5] = {3, 1, 4, 5, 9};

we set up two containers named str and values:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 14 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

Arrays are different in Java. The only way to use an array in Java involves the additional concept of a reference, drawn as an arrow as shown below. an array in C an array in Java

Arrays in C are different too. Experiment with C arrays to show that they don't act quite like int or char containers. 3.2.4 (Self Test) Things an array can't do Which of the following are illegal uses of an array variable named a? (Assume that b names an array whose size and element type are the same as those of a.)

3.2.5 (Display page) References in C References in C

C has references too. In C, they're called pointers. The type

type1 * is a pointer to a value of type type1. Thus the declaration

char *str = "go bears";

sets up str as a reference to the string "go bears", as shown below. The pointer itself takes up some memory—pointers on the lab computers take up 4 bytes—and the nine bytes for the string

constant are reserved somewhere else. The declarations

char *p; int *pos;

set up two uninitialized pointers. C's references are more flexible than Java's. If values is a reference to an array, then *values is the first element of the array.

int values[] = {3, 1, 4, 5, 9}; int n; n = *values; /* n now contains 3 */ Observe the symmetry between declaration and use. One may read the declaration

int * values;

either as "values is a pointer to an int" or as "the expression *values is an int". 3.2.6 (Brainstorm) Assigning to a pointed-to container What would you expect the statement

*values = 27; to do? (There are subtleties here that we will discuss in a subsequent lab.) 3.2.7 (Display page) More on pointers More on pointers

A pointer is a container as well as a reference; its contents is an arrow to the referenced object. 0, or NULL, is the null reference. Thus a pointer variable may either be uninitialized,

int *values; it may contain null,

int *values; values = NULL; or it may contain an arrow. Unlike Java, C doesn't do any automatic initialization. When declaring two pointers in the same declaration line, you need the * for both. For example,

int *p1, p2;

declares a pointer p1 and an int p2, not two pointers as some might expect. In assignment statements, C automatically converts an array container to the corresponding reference, for example:

int values[10]; int k; int *p; for (k=0; k<10; k++) values[k] = 2*(9-k); p = values; printf ("%d\n", *p); /* prints 18 */ http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 15 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

The same automatic conversion takes place for function arguments. The code below, for example, prints different sizeof values:

#include

void func (int values[5]) { printf ("sizeof(values) = %lu\n", sizeof(values)); }

int main ( ) { int values[5]; printf ("sizeof(values) = %lu\n", sizeof(values)); func (values); return 0; } 3.2.8 (Brainstorm) Something we need pointers for Consider the following code.

int n1, n2; int temp; n1 = 3; n2 = 25; /* exchange the values in n1 and n2 temp = n1; n1 = n2; n2 = temp; We ought to be able to abstract this exchanging operation by coding it in a function, e.g.

void exchange (int a, int b) { int temp; temp = a; a = b; b = temp; }

int main ( ) { int n1, n2; n1 = 3; n2 = 25; exchange (n1, n2); printf ("n1 = %d, should be 25; n2 = %d, should be 3\n", n1, n2); } This doesn't work as intended. Why not? 3.2.9 (Display page) Pointer arithmetic Pointer arithmetic

One may do arithmetic with pointers. For example, if p is a pointer to an element of an int array, p++ increments p to point to the next int in the array. For an array named, say, values, the two expressions values[k] and *(values+k) are equivalent. 3.2.10 (Brainstorm) Nonsense pointer arithmetic? Assuming that p1 and p2 represent legal pointer values and n is a small integer, which of the following expressions are likely to be nonsense? Briefly explain your answer.

1. p1 + n 2. n + p1 3. p1 + p2 4. p1 – n 5. n – p1 6. p1 – p2 7. p1 > p2 8. p1 > n 9. p1 == 0

3.2.11 (Display page) Pointers are really general! Pointers are really general

It's easy to exchange items for which we have references. For example, the code below exchanges the first characters in the two strings:

void exchange1stChars (char *s1, char *s2) { char temp; temp = *s1; *s1 = *s2; *s2 = temp; } In C, every variable has an address that acts like a reference to that variable. It's accessed via the & operator. The following code successfully implements exchange of two integers.

#include

void exchange (int* a, int* b);

int main ( ) { int k1 = 10; int k2 = 20; exchange (&k1, &k2); printf ("k1 was 10, is now %d; k2 was 20, is now %d.\n", k1, k2); http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 16 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

return 0; }

void exchange (int *a, int *b) { int temp = *a; *a = *b; *b = temp; } 3.2.12 (Brainstorm) A variation on exchanging Consider the following program, that differs from the previously seen exchange code only in the declaration and use of temp.

#include

void exchange (int *px, int *py) { int *temp; *temp = *px; *px = *py; *py = *temp; }

int main ( ) { int a = 3, b = 7; exchange (&a, &b); printf ("a's new value is %d; b's new value is %d\n", a, b); return 0; } What's the effect of running this program? Choose from the list below, and justify your answer.

a. The program doesn't compile. b. The program compiles, but when run, it might crash. c. The program compiles. When run, it prints

a's new value is 3; b's new value is 7 d. The program compiles. When run, it prints

a's new value is 3; b's new value is 3 e. The program compiles. When run, it prints

a's new value is 7; b's new value is 3 f. The program compiles. When run, it prints

a's new value is 7; b's new value is 7

3.2.13 (Display page) scanf scanf

The scanf function illustrates the use of references to scalar variables. scanf is the main way to do formatted input in C. It is superficially similar to printf, in that its first argument is a formatting string that indicates the format of the input. This string contains %d, %c, etc. The remaining arguments are pointers to variables that scanf will fill with input values. (Why are pointers needed for the variables to be filled?) If these variables are scalar—int and char—ampersands must be used with them. 3.2.14 (Display page) A scanning function A scanning function

Write a function named nextWord that returns the next word in a string passed as an argument. The prototype of nextWord is

int nextWord (char *str, int pos, char *word); Its parameters are as follows:

1. a string str from which you would like the next word; 2. an int named pos that says where in the string the search for the next word should begin; 3. another string named word into which the next word should be copied.

The nextWord function should return the position of the whitespace character or null character that indicates the end of the word being returned, or -1 if there are no more words left in the string. You may assume that str contains no more than 100 characters, and that words are no more than 20 characters long. You may also assume that pos is between 0 and 1 plus the length of str, inclusive. Here is a test program, also available in ~cs61cl/code/nextword.c:

#include #include #define MAXLINELEN 101 #define MAXWORDLEN 21

int nextWord (char *, int, char *);

int main ( ) { char line[MAXLINELEN]; char word[MAXWORDLEN]; int pos;

while (getline (line, MAXLINELEN-1)) { pos = 0; while ((pos = nextWord (line, pos, word)) != -1) { printf ("%s\n", word); http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 17 of 194 Curriculum Builder Portal 2/11/10 8:54 AM

} } return 0; }

You wrote the getline function for an exercise in the previous lab. The printf function, given a string as argument to correspond with %s in the format string, prints the characters in the string (not including the terminating null). Suppose the line contains " a abc ". Then the inner loop would execute three times, with return values indicated in the table below. call contents of word return value first (pos == 0) "a" 4 second "abc" 9 third undefined -1 3.3 Files(6 steps) 3.3.1 (Display page) Input and output streams we've encountered already Input and output streams we've encountered already

In the previous lab, you worked with code that read from stdin (standard input) and wrote to stdout (standard output). These are variables of type FILE * that are set up by the operating system to particular input and output streams of characters. For example, stdin is by default associated with the input stream coming from the workstation keyboard. However, the UNIX shell allows input to be redirected to come from a file or from the output of another program; in those cases, stdin will be initialized to one of these alternative input streams. Note the use of pointers as an information hiding mechanism. Instead of revealing all the details of the FILE type, the C library routines allow client code only to deal with a pointer to a FILE. The internal variables of a FILE are inaccessible to the user. In the next few activities, we will work with code that accesses files directly, using variables not already initialized by the system. 3.3.2 (Display page) File operations File operations

File variables are of type

FILE *

The FILE type is defined in stdio.h, the file that's #include'd by almost all C programs. stdio.h also contains declarations for the preinitialized FILE * variables:

FILE *stdin; /* standard input */ FILE *stdout; /* standard output */ FILE *stderr; /* "standard error", for diagnostic output */

The first task is to associate a FILE * variable with a file in the external directory structure. This is done with the fopen function. fopen takes two arguments: a character string—the name of the file in the external directory structure—and a string that indicates how the file will be accessed. Allowable values for this second argument include "r", indicating that the file will be read, and "w", indicating that the file will be written. (In the latter case, the file will be overwritten if it exists already.) fopen returns either a reference to the external file, or a null reference if for some reason the file can't be accessed. (Common reasons for this are that a file being opened for input doesn't exist, or an output file does exist and is protected against overwriting, or that the directory is protected against access or modification.) A useful function for character input is fgets. The first argument to fgets—we'll call it s—is the character array into which characters will be read. Its second argument is an int n that specifies how many characters will be read into s—no more than n-1. This allows the programmer to guard against overflow of the space allocated for s. The third argument is the FILE * variable from which the characters will be read. fgets reads characters into the s until either it sees a newline character '\n' or it reads the n-1st character. The newline, if encountered, is stored into s. In either case, a null character '\0' is also added to s. fgets returns s, or NULL—i.e. zero—if end of file or an error occurs. 3.3.3 (Brainstorm) Calls to fgets Consider the following program segment.

char s[3]; fgets (s, 3, stdin); fgets (s, 3, stdin); fgets (s, 3, stdin); fgets (s, 3, stdin); fgets (s, 3, stdin);

Give the contents of s after each call to fgets, assuming that the next eight characters in stdin are

'A' 'B' 'C' '\n' '\n' '\0' 'D' '\n' 3.3.4 (Display page) File output operations File output operations

We've seen the printf function. Its file counterpart is fprintf. The first argument to fprintf is a FILE *, representing the output stream. Subsequent arguments are exactly like those for printf. In fact, the following are equivalent:

printf ( ... ); fprintf (stdout, ... );

The fclose function is the inverse of fopen; it removes the association between its argument and the corresponding input or output stream. This may be used with input streams as well as output http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 18 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

streams. One aspect of output handling that requires fclose, however, is that output characters are stored in a separate array called a buffer before being copied to a file, and fclose makes sure that all of the buffer is copied to the file—"flushed" is the technical term—before the program ends. We'll cover the details of input and output later this semester. 3.3.5 (Display page) File names on the command line File names on the command line

You may have already encountered the argv array (the name means "argument vector"), which gives you access to the arguments given on the command line when the program was started. argv and argc—the number of arguments on the command line—are supplied as parameters to main:

int main (int argc, char *argv[ ]) { ... }

argv is thus an array of strings. The string argv[0] is the name by which the program was invoked. (This usage differs from that of Java.) Since this is always provided, argc is always at least 1. The first argument is then argv[1], the second is argv[2], and so on. The C standard requires that argv[argc] be a null pointer, providing a bit of redundancy. Section 5.10 in K&R supplies more detail about accessing and using command-line arguments. Write a program that prints one of the following, depending on the contents of the command line.

too many arguments too few arguments argument file does not exist or is inaccessible argument file is empty argument file is nonempty

The last three possibilities occur when the command line contains exactly one word after the command itself. 3.3.6 (Display page) Rewrite of wc Rewrite of wc

Last lab you wrote wc, using stdin and stdout. Now rewrite it with input taken from a file named on the command line. Print a message to stderr if no files or too many files are named on the command line, or if the file that is named can't be opened for some reason. Test your program by comparing its output with your version from last lab. 3.4 Homework and readings(1 step) 3.4.1 (Display page) Things to do Things to do

A reminder: the line processing homework assignment described in the previous lab is due prior to lab on Thursday or Friday. We'll be covering linked lists next lab. Sections 6.4 through 6.7 in K&R will be useful background reading. 4 Structures, abstract data types, and dynamic storage 2009-1-29 ~ 2009-1-30 (4 activities) 4.1 Structures(7 steps) 4.1.1 (Display page) Examples of structures Examples of structures

A struct—short for "structure"—provides a way to group data of different types. It's basically a Java object without methods or inheritance. Here are several examples that involve structs.

A point on a display is basically a (row, column) coordinate pair. Grouping the coordinates in a struct is thus quite natural:

struct point { int row; int column; }; A key-value pair is another common use of a struct. We might represent the decimal equivalent of a Roman digit ('I' means 1, 'V' means 5, etc.):

struct romanTranslation { char romanDigit; int decimalEquiv; }; Code to manipulate a student data base might involve a more complicated struct:

struct student { char * name; char * studentId; int unitCount; ... };

4.1.2 (Display page) Declaring structures Declaring structures

Finally, we might pair an array of values, representing a collection, with the number of elements currently in the collection. In both examples below, the variable values represents a collection of at most 20 integers.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 19 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

struct collection { #define MAXSIZE 20 int myValues[MAXSIZE]; struct collection { int mySize; int myValues[MAXSIZE]; }; /* don't forget the semicolon */ int mySize; ... } values; struct collection values; The phrase "struct ___" is a type in C, as seen in the second example above. 4.1.3 (Display page) struct operations struct operations

The variables grouped by a struct are called "members" or "fields". One accesses one of the fields by using the dot operator, similar to Java.

struct collection values; ... values.myValues[0] = 314; values.mySize = 1;

A struct variable may be initialized using curly brace notation:

struct collection emptyCollection = {NULL, 0}; It may also be initialized via assignment, passed as a function argument, or returned as the value of a function.

struct collection values, emptyCollection; ... values = emptyCollection;

Unfortunately, two structs may not be compared with ==. Equality comparison must be done field by field. 4.1.4 (Brainstorm) Arrays vs. structs How are structs similar to arrays? How are they different? 4.1.5 (Brainstorm) Decoding an error message The program below, when compiled, produces the error message

error: two or more data types in declaration of 'p' What is the error, and which two data types is it complaining about? The program:

int main ( ) { struct point { int x; int y; } struct point p; p.x = 7; p.y = 8; return 0; } 4.1.6 (Display page) Arrays of structs Arrays of structs

The code below, also in ~cs61cl/code/table.c, is a framework for a program to read and store pairs of information (from a file named as a command-line argument), then to print the pair values just read. You are to fill in the missing code. Here is some sample input.

pencil 5 pen 14 eraser 2

The program globally declares a struct table that represents a collection of pairs. As in earlier examples, we include as fields of the table an array of pairs and the number of pairs in the array. Each pair consists of a char array and an associated numeric value. This exercise assumes you've completed the activities involving files and command-line arguments, and have also written the nextWord function that successively returns words from a line. (The atoi function will be useful in converting the second word on each line to an int. Alternatively, you may use scanf to read values from a line.) You may assume that the maximum number of lines in the file, the length of each line, and the length of each word are as given in the #define directives at the start of the program.

#include #include #include #include

#define MAXNUMLINES 10 /* max number of lines in a file */ #define MAXLINELEN 100 /* max line length */ #define MAXNAMELEN 20 /* max length of word on a line */

struct pair { char name[MAXNAMELEN]; int value; };

struct table { struct pair elements[MAXNUMLINES]; int elementCount; } pairs;

void loadTable (char *fileName); http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 20 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

void printTable ( ); int nextWord (char line [ ], int pos, char word [ ]);

int main (int argc, char *argv[ ]) { loadTable (argv[1]); printTable ( ); return 0; } 4.1.7 (Display page) Pointers to structs Pointers to structs

We have seen structs that contains pointers, arrays that contain structs, but we haven't quite seen pointers to structs. A pointer to a struct is essentially what we think of as an object reference in a language like Java. It is so important in C that there is special syntax for it. Here is some review, followed by explanation of pointers to structs. K&R provide more detail in chapter 6. Let's take the point structure from earlier in lab.

struct point { int row; int column; }; We can statically allocate one with

struct point origin ; We can statically allocate one and initialize it with

struct point origin = {0,0}; We can refer to one of its coordinates as

origin.row We can declare a pointer to one of these with

struct point *p;

We could even assign p to point to origin with

p = &origin; And we could reference one of its coordinates using either the combination

(*p).row or a special arrow syntax:

p->row 4.2 Abstract data types in C(5 steps) 4.2.1 (Display page) Requirements for an abstract data type Requirements for an abstract data type

Wikipedia defines an abstract data type as follows:

An abstract data type (ADT) is a specification of a set of data and the set of operations that can be performed on the data. Such a data type is abstract in the sense that it is independent of various concrete implementations.

Abstract data types are important to programmers because of how programs typically evolve: a modification to a program commonly requires a change in one or more of its data structures. For instance, a new field might be added to a personnel record to keep track of more information about each individual; an array might be replaced by a linked structure to improve the program’s efficiency; or a bit field might be changed in the process of moving the program to another computer. A programmer doesn’t want such a change to require rewriting every procedure that uses the changed structure. Thus, it is useful to separate the use of a data structure from the details of its implementation, and to hide the latter from the user of the type. The convention for doing this separation in C is to use two files for a data type:

One, whose name ends in ".h", provides prototypes for functions that represent operations on the type. This provides the abstract view of the type. The second, whose name ends in ".c", provides the code—function bodies and variable declarations—that implements those operations.

4.2.2 (Display page) Relevance of pointers to structs Relevance of pointers to structs

The user of a data type will generally want to work with instances of the type. One thus needs a general way to name the type while preserving the ability to replace one implementation with another. Object-oriented languages handle this problem with inheritance . In C, we represent each such type as a pointer to a struct. Moreover, we take advantage of a feature of C that allows a pointer to a struct to be defined without having to know the fields of the struct. Here's an example of the ".h" and ".c" files for a stack type.

stack.h

struct stack; /* allows use of pointer to represent a stack */

/* Return a pointer to an empty stack. */ http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 21 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

extern struct stack * init ( );

/* Push value onto the stack, returning success flag. */ extern boolean push (int, struct stack *);

/* Pop value from the stack, returning success flag. */ extern boolean pop (struct stack *);

/* Print the elements of the stack. */ extern void print (struct stack *);

stack.c

#include "stack.h" #define STACKSIZE 5 struct stack { /* stack is implemented as */ int stackItems [STACKSIZE]; /* an array of items */ int nItems; /* plus how many there are */ } myStack;

/* ** Return a pointer to an empty stack. */ struct stack * init ( ) { myStack.nItems = 0; return (&myStack); } ... 4.2.3 (Brainstorm) A flaw in the implementation What's the result of the following sequence of statements?

struct stack * stack1 = init ( ); push (23, stack1); struct stack * stack2 = init ( ); (We'll see how to deal with this problem shortly.) 4.2.4 (Display page) Implementation of a queue Implementation of a queue

The files queue.h and queue.cin ~cs61cl/code define operations for a queue (a first-in, first-out structure). The code for an inefficient implementation is given below. Supply correct bodies for the missing functions, then test your implementation using the queuemain.c program.

queue.h

struct queue;

/* return an empty queue */ struct queue *init ( );

/* return true if the queue is empty */ int isEmpty (struct queue *q);

/* add an element to the end of the queue */ struct queue * enqueue (int val, struct queue *q);

/* remove the front of the queue; do nothing if the queue is empty */ struct queue * dequeue (struct queue *q);

/* return the front of the queue; undefined result if the queue is empty */ int front (struct queue *q);

/* print the queue */ void print (struct queue *q);

queue.c

#include #include "queue.h"

#define FALSE 0 #define TRUE !FALSE #define MAXQUEUESIZE 4

struct queue { int elems[MAXQUEUESIZE]; } myQueue;

/* return an empty queue */ struct queue *init ( ) { int k; for (k=0; k

/* return true if the queue is empty */ int isEmpty (struct queue *q) { return q->elems[0] == 0; }

/* add an element to the end of the queue */ struct queue * enqueue (int val, struct queue *q) { /* You fill this in. */ http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 22 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

return q; }

/* remove the front of the queue; do nothing if the queue is empty */ struct queue * dequeue (struct queue *q) { /* You fill this in. */ return q; }

/* return the front of the queue; undefined result if the queue is empty */ int front (struct queue *q) { return q->elems[0]; }

/* print the queue */ void print (struct queue *q) { int k; for (k=0; kelems[k] == 0) { printf ("\n"); return; } else { printf ("%d ", q->elems[k]); } } printf ("\n"); }

queuemain.c

#include #include "queue.h"

int main ( ) { struct queue *q; q = init ( ); print (q); printf ("is a new queue empty? %s\n", isEmpty(q)?"yes":"no"); q = dequeue (q); print (q); printf ("is an almost-new queue empty? %s\n", isEmpty(q)?"yes":"no"); q = enqueue (3, q); print (q); printf ("is a queue that contains 3 empty? %s\n", isEmpty(q)?"yes":"no"); q = enqueue (1, q); print (q); printf ("is a queue that contains 3 1 empty? %s\n", isEmpty(q)?"yes":"no"); q = enqueue (4, q); print (q); printf ("is a queue that contains 3 1 4 empty? %s\n", isEmpty(q)?"yes":"no"); q = enqueue (5, q); print (q); printf ("is a queue that contains 3 1 4 5 empty? %s\n", isEmpty(q)?"yes":"no"); q = dequeue (q); print (q); printf ("is a queue that contains 1 4 5 empty? %s\n", isEmpty(q)?"yes":"no"); q = dequeue (q); print (q); printf ("is a queue that contains 4 5 empty? %s\n", isEmpty(q)?"yes":"no"); q = dequeue (q); print (q); printf ("is a queue that contains 5 empty? %s\n", isEmpty(q)?"yes":"no"); q = dequeue (q); print (q); printf ("is an empty queue empty? %s\n", isEmpty(q)?"yes":"no"); return 0; } 4.2.5 (Display page) Another implementation Another implementation

Now substitute ~cs61cl/code/circqueue.c for queue.c. The circqueue.c code implements a circular queue, a more efficient implementation that requires less moving elements around. (Wikipedia has a nice article on circular queues.) The code appears below. Again, fill in the missing function bodies and test the implementation with queuemain.c.

#include #include "queue.h"

#define FALSE 0 #define TRUE !FALSE #define MAXQUEUESIZE 5 /* need one more element than in queue.c */

struct queue { int elems[MAXQUEUESIZE]; int first; /* index of the head of the queue */ int last; /* index of the empty space following the last item in the queue */ } myQueue;

/* return an empty queue */ struct queue *init ( ) { myQueue.first = myQueue.last = 0; return &myQueue; }

/* return true if the queue is empty */ int isEmpty (struct queue *q) { return q->first == q->last; http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 23 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

}

/* add an element to the end of the queue */ struct queue * enqueue (int val, struct queue *q) { /* You fill this in. */ return q; }

/* remove the front of the queue; do nothing if the queue is empty */ struct queue * dequeue (struct queue *q) { /* You fill this in. */ return q; }

/* return the front of the queue; undefined result if the queue is empty */ int front (struct queue *q) { return q->elems[q->first]; }

/* print the queue */ void print (struct queue *q) { /* You fill this in. */ } 4.3 Dynamic memory allocation(8 steps) 4.3.1 (Display page) Introduction to malloc Introduction to malloc

One might view C as the "manual transmission" of programming languages, since it provides more control over a computation. An example is dynamic memory management. So far we have seen two kinds of storage:

Static storage exists from the time the program starts until it exits. This includes variables that are declared outside of any procedure. Such variables are called external variables in C. The code for functions is also statically allocated (although the functions are called dynamically). Local storage exists for the duration of a function call. This includes the arguments passed to the function and the variables declared within the function. Such variables are call automatic in C.

You will shortly learn how to allocate storage dynamically for objects that are created during the execution of the program, but whose lifetime is not necessarily contained within that of a particular function call. This kind of storage is implicit in object-oriented languages like Java and C++. For example, in Java, if you have a class Planet, you create a new object of that class with

Planet terra = new Planet();

This automatically allocates storage to hold all the fields of the new Planet object and associates a set of methods with that object. The storage for the object is implicitly reclaimed when there are no longer any references to the object. Automatic storage management is integrated deeply into the language. In C, we allocate storage manually by calling malloc. It is a general-purpose storage allocator. We call it for any object. It doesn't even know what kind of object we are talking about. It just allocates a contiguous chunk of bytes. We give malloc the number of bytes we want, and it returns a pointer to space for an object of the requested size, or NULL if the request cannot be satisfied. Its signature is

void *malloc (unsigned nbytes);

When we use malloc, we have to tell it what kind of object we are talking about. We do this with a cast, a request to reinterpret the void* pointer that malloc returns as a pointer to some other type. For example,

char *p = (char *) malloc(17);

causes p to point to a newly allocated chunk of empty space that is at least 17 bytes in size. If you allocated such space, you could discard it manually with free(p); Be sure that it is no longer needed before you free it, otherwise very confusing and usually bad things are likely to happen. Both of these functions are defined in , along with fancier tools. They are standard in C implementations, but much less deeply integrated into the language than, say new in Java. 4.3.2 (Self Test) Kinds of storage What kind of storage is used for the variable k in the code below?

#include int n = 7;

int main() { int *k; n = 5; k = (int *) malloc (sizeof (k)); *k = 3; return 0; }

What kind of storage is used for the variable n in the code just given?

What kind of storage is used for the variable *k in the code just given?

4.3.3 (Display page) Dynamic string creation Dynamic string creation http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 24 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Here's a simple example, the C equivalent of dynamic object creation.

#include #include #include

char *new_string (char *s) { int len = strlen (s); char *new_s = malloc (len+1); strcpy (new_s, s); return new_s; } It takes a string as input and creates a new object containing this string. In a sense, it duplicates the string. The code uses strcpy, a handy routine from the string library, rather than writing a loop to do the copy. Look it up in K&R. Here's a little driver function you can use to try it out. Test it with and without a command-line argument. The next step will ask you about the output.

int main (int argc, char *argv[]) { char *s = argc > 1 ? argv[1] : "this is a test"; char *t = new_string (s); printf ("s @%08x: %s\n", s, s); printf ("t @%08x: %s\n", t, t); return 0; } Some features in the code may be unfamiliar; look them up in K&R. 4.3.4 (Brainstorm) Memory layout What does the output of the program given in the previous step tell you about how the different kinds of storage are laid out in memory? 4.3.5 (Display page) A dynamic array A dynamic array

Complete the function range in the following example. it should return a new array containing all the integer values between low and high, inclusive. (Hint: remember that sizeof(int) returns the size in bytes of an object of type int. 10 * sizeof (int) is equal to sizeof (A) if A were declared as int A[10];

#include #include

int *range(int low, int high) { ... you write this part... }

int main(int argc, char *argv[]) { int i; int l = (argc > 1) ? atoi(argv[1]) : 0; int h = (argc > 2) ? atoi(argv[2]) : 0; int *r = range(l,h); for (i=0; i

For example, ./range -3 5 prints

-3 -2 -1 0 1 2 3 4 5 4.3.6 (Display page) More dynamic string objects More dynamic string objects

Here is a more complete example with some patterns that you should study and may want to reuse. It is our familiar friend that copies an input file to an output file line by line. But getLine has been generalized to newLine. Instead of passing it an array to fill with characters, newLine creates a new string object containing the contents of the line it gets.

#include #include

/* Read a line of text into line array. Clean up potential missing lf in final line. */ #define MAXLINELEN 128

char *newLine (FILE *infile) { int i, ch; char *s = NULL; char line [MAXLINELEN]; /* Temporary array */ int len = 0; if (feof (infile)) return NULL; /* Read the line into a local array up to a newline. */ ch = fgetc(infile); while ((ch != EOF) && (ch != '\n')) { if (len < sizeof(line)-1) line[len++] = ch; ch = fgetc(infile); } /* Allocate a string to hold it, copy and return. */ s = malloc(len+1); for (i = 0; i < len; i++) s[i] = line[i]; s[i] = '\0'; http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 25 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

return s; }

int main(int argc, char *argv[]) { char *s; FILE *infile; FILE *ofile;

infile = stdin; ofile = stdout; if (argc >= 3) ofile = fopen(argv[2],"w"); if (argc >= 2) infile = fopen(argv[1],"r");

s = newLine(infile); while (s) { fprintf(ofile,"%s\n",s); s = newLine(infile); } fclose(ofile); return 0; } 4.3.7 (Self Test) Questions about string objects In the previous example, when can the string object s be freed?

Why is the line array necessary?

When is the storage for the temporary buffer line reclaimed?

4.3.8 (Display page) Creating data structures on the fly Creating data structures on the fly

Modify the previous example so that, instead of printing the strings as they are encountered, you build up an array of strings that represents the entire input file. Declare an external array to hold all the strings, e.g.,

#define MAXLINES 100000 char *lines[MAXLINES]; /* Array of strings */

Modify main so that it stores the strings into the lines array while reading them in and then prints them all when no more lines remain. 4.4 Homework and reading(2 steps) 4.4.1 (Display page) Homework summary Homework summary

Homework assignment 2 is due this coming Friday (just before midnight), as well as your peer evaluations of homework 1. Homework assignment 3, due Monday February 9, is described in the next step. It's getting to the point where you will find the C library functions to be very helpful. At some point soon, you should review the descriptions of the following libraries in K&R Appendix B:

stdio.h ctype.h string.h stdlib.h limits.h

4.4.2 (Display page) Homework assignment 3 Homework assignment 3

Submit a solution to this assignment to Expertiza by Monday, February 9 at 11:59pm. Details for doing so will be provided soon. This is an individual assignment; do your own work. Write a program that reads lines from a file named in the first command-line argument, and keeps track of symbol definitions and symbol uses. A symbol definition is a word of at least two characters that ends with a colon; the defined symbol is what results from removing the colon. For example, "abc:" is a definition of the symbol named "abc". A symbol use is just an occurrence of the symbol on a line of text. The output of your program should be written to a file named in the second command-line argument. There should be one line for each definition in the input file, containing the following:

1. the name of the defined symbol (without the colon) 2. a tab character 3. the number of the line on which the symbol was defined 4. a tab character 5. a (possibly empty) "use list" of line numbers containing uses of the symbol, with consecutive line numbers separated by tab characters

Line numbers start at 1. Strip sequences of characters starting with "#" from each line as in homework 2. Ignore—i.e. do not number—blank lines, or lines that become blank as a result of # removal. All lines that contain uses of a given symbol will appear in the symbol's use list. The use list should be sorted in increasing order. A symbol used multiple times on a line should contribute one entry in the use list for each use. A line may contain multiple definitions. The lines should be sorted alphabetically by symbol name. You may assume that each line in the file is no more than 100 characters long, and that each symbol is no more than 20 characters long. You may also assume that a colon will not appear by itself in a word; i.e. don't worry about the cases

start_of_line : end_of_line http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 26 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

white_space : end_of_line start_of_line : white_space white_space : white_space

You may not assume any limit on the number of lines in the file. Assume that each symbol will be defined exactly once, but uses of the symbol may precede its definition. Here's an example. input file output file # blank line abc 3 1 abc mike 4 y: x 3 4 y 2 3 3 5 6 abc: x: y y mike: x y y You should implement the symbol table as an abstract data type, according to the discussion in lab activities of January 29-30. In particular, isolate the implementation details of table manipulation in a file table.c. Here is the corresponding table.h; your table.c should implement the functions whose prototypes appear below.

#include

struct table;

/* returns an empty table */ struct table * initTable ( );

/* returns the result of adding a definition for the given symbol in the line with the given number to the table */ struct table * addDef (char * symbol, int lineNum, struct table * symbols);

/* returns the result of adding a use of the given table in the line with the given number to the table */ struct table * addUse (char * symbol, int lineNum, struct table * symbols);

/* print the table, in the format described above, to the given file */ void printTable (struct table * symbols, FILE * out);

You should also supply a file named tabletest.c that provides a main program, input handling code, etc. You may use any other code you've written this semester, code from K&R, and C library routines in your solution. 5 Memory layout, more about pointers, linked lists 2009-2-03 ~ 2009-2-04 (5 activities) 5.2 Memory layout(13 steps) 5.2.1 (Display page) Peeking beneath the memory abstraction layer Peeking beneath the memory abstraction layer

One of the important layers of abstraction that high-level languages provide is that memory is just a collection of named variables and references to those variables (either pointers or uses of &). In this set of activities, we will peek beneath this abstraction layer, using the & operator in combination with gdb to investigate exactly where in memory various objects are stored. What is memory anyway? It may be regarded as a large array of bytes, as shown below. The address of an object is the position in that array of the object. An object's size (accessible using the sizeof operator) is the number of

bytes it takes up in memory. Memory is also organized into words, usually four- byte chunks. Values of type int and pointers are stored in words. Moreover, they are aligned to addresses a0a1a2a3, where address a0 is divisible by 4, a1 = a0+1, a2 = a1+1, and a3 = a2+1, as

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 27 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

shown below. We will now go on to explore how various high-level language constructs—local variables and function parameters, global variables, storage allocated by new in Java or C++ and by malloc in C, and even functions—are stored in memory. 5.2.2 (Display page) The four kinds of memory Four kinds of memory

Memory in an executing program is split into four areas:

Program "text": the machine instructions of the program. This doesn't change while the program is running. The stack, used for variables declared inside functions, parameters, and function return addresses. A function's stack segment (and the data it contains) goes away when the function returns. The heap: for variables initialized with malloc. This data lives until deallocated by the programmer. Static memory: for variables declared outside functions. This data is basically permanent for the entire program run.

Shown below are two diagrams showing how these areas are laid out. On the left is the layout on the MIPS architecture (which we'll be studying in detail next week). On the right is the memory layout in early operating systems on the Intel 80x86 architecture.

Memory layout in the MIPS architecture (the Memory layout in early operating text segment stores the code for the executing systems on the Intel 80x86 program) architecture 5.2.3 (Brainstorm) Memory layout comparison The memory layout just described for Intel 80x86 systems is not as flexible as its MIPS counterpart. Give a reason why. 5.2.4 (Display page) A program to explore A program to explore

Do the rest of today's exercises with a partner. As suggested earlier, the & operator can be used with any variable, even a function-valued variable, to find out where that variable is stored in memory. The program ~cs61cl/code/stgtest.c, reproduced below, contains objects of all four storage classes. Copy and compile the program, then run it using gdb. Set a breakpoint somewhere in the program and start the program. When it stops, use the print command and & with a variable to answer the questions in the next step.

#include #include #include

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 28 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

char s[5];

char *str (int k) { char *rtn; int i; rtn = (char *) malloc (k); for (i=0; i

int main ( ) { char s1[] = "abc"; int a1[] = {3, 1, 4}; char *s2 = "def"; char *p1 = str(4); char *p2; int *a3; strcpy (s, p1); p2 = s1; a3 = a1; return 0; } 5.2.5 (Brainstorm) Layout of storage classes The figure below (figure A.5.3 in Appendix A of the third edition of P&H or figure B.5.1 in Appendix B of the fourth edition) shows how memory is laid out.

Determine the order in which the four storage classes—system stack, heap (described as "dynamic data" in the figure from P&H), static data (global variables), and program text—appear in memory on the lab computers. 5.2.6 (Display page) What the system stack is used for What the system stack is used for

The system stack is a region of memory provided for each process. It is used mainly to deal with function calls. When a function is called, it reserves a block of memory on the stack, storing copies of arguments along with the address of the instruction to return to. Some of that memory is also used to store values of the function's local variables. When the function returns, the memory it reserved is released. 5.2.7 (Display page) An example An example

Consider the following recursive code, online in the file ~cs61cl/code/print+commas.c. The function printWithCommas prints its argument with commas inserted every third digit from the right, for example, printing 123000000 as 123,000,000.

void printWithCommas (unsigned int n) { if (n < 1000) { printf ("%u", n); } else { printHelper (n); } printf ("\n"); }

void printHelper (unsigned int n) { unsigned int last3; if (n < 1000) { printf ("%u", n); } else { last3 = n % 1000; printHelper (n / 1000); /* print comma, then last 3 digits with leading zeroes */ printf (",%03d", last3); } }

Suppose printWithCommas were called from main via the statement

printWithCommas (1234567);

At the point the base case of printHelper is reached, the stack contents are as shown in the diagram below. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 29 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

stack entry for PrintHelper, including space for n = 1, last3, and the return address in PrintHelper stack entry for PrintHelper, including space for n = 1234, last3, and the return address in PrintHelper stack entry for PrintHelper, including space for n = 1234567, last3, and the return address in PrintWithCommas stack entry for PrintWithCommas, including space for n = 1234567 and the return address in main stack entry for main

With your partner, run the program now with gdb, set a breakpoint at the start of the printHelper function, and use the & operator and the print command to answer the questions in the following steps. 5.2.8 (Brainstorm) Exploring with gdb (1) Does the stack grow up, with the addresses of more recently pushed elements being greater than those of less recently pushed elements, or does it grow down in memory? Explain how you determined this. 5.2.9 (Brainstorm) Exploring with gdb (2) In hexadecimal, what is the size of the block of memory that printHelper allocates on the stack at each call? Explain how you figured that out. 5.2.10 (Brainstorm) Exploring with gdb (3) Provide some evidence that local variables in a C function are not initialized to any specific value. 5.2.11 (Display page) Exploring pointers into the stack Exploring pointers into the stack

Consider the following program, online in ptrIntoStack.c.

#include

int *f (int varNotToSet, int varToSet) { int n = 7; varToSet = 7; return _____ ; // &n, &varNotToSet, or &varToSet could go here. }

int fib (int n) { if (n == 0 || n == 1) { return n; } else { return fib(n-1) + fib(n-2); } }

int main ( ) { int *ptr; int k1 = 7, k2 = 0, n; ptr = f (k1, k2); printf ("7 = "); printf ("%d\n", *ptr); return 0; } Again with a partner, run experiments to answer the following question: What choice for the blank ensures that "7 = 7" is printed?

a. &n, &varNotToSet, and &varToSet all work. b. Two of &n, &varNotToSet, and &varToSet work, but one doesn't. c. Exactly one of &n, &varNotToSet, and &varToSet works, but two don't. d. None of &n, &varNotToSet, and &varToSet are guaranteed to work.

As part of your experimenting, insert the call

n = fib (5);

immediately after the assignment to ptr. What difference does that modification make in the output, and why? Report your results in the next step. 5.2.12 (Brainstorm) Results of the experiments What choice for the blank ensures that "7 = 7" is printed?

a. &n, &varNotToSet, and &varToSet all work. b. Two of &n, &varNotToSet, and &varToSet work, but one doesn't. c. Exactly one of &n, &varNotToSet, and &varToSet works, but two don't. d. None of &n, &varNotToSet, and &varToSet are guaranteed to work.

Explain how you determined the answer. 5.2.13 (Brainstorm) Exploring with * Just as you can use & to find out where a given object is located in memory, you can use the asterisk (*) operator to find out what is in memory at a given address. For example, the gdb command

print *(0xffbff6fc) will tell you what is in memory at address 0xffbff6fc. Find out if any of the memory locations between 0x10000 and 0xFFBFF800 is inaccessible to you, and report it in your answer. 5.3 Dynamically allocated linked lists(7 steps) 5.3.1 (Display page) Reading lines of a file into memory http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 30 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Reading lines of a file into memory

In earlier activities we read a file and created an array of strings, one for each line, so that we could manipulate the information in the file as a data structure. An awkward aspect of that code was that we had to preallocate the array and we didn't know how big to make it. We didn't know how many lines were in the file until we read the whole file. We picked some arbitrary constant MAXLINES and if the file was bigger, oh well, bummer. It went something like this.

#define MAXLINES 100000 char *lines[MAXLINES]; /* Array of strings */

int main (int argc, char *argv[]) { char *s; FILE *infile = stdin; FILE *ofile = stdout; int lineNum = 0; int k; if (argc >= 3) ofile = fopen (argv[2],"w"); if (argc >= 2) infile = fopen (argv[1],"r"); s = newLine (infile); while (s && (lineNum < MAXLINES)) { lines[lineNum++] = s; s = newLine (infile); } for (k=0; k

This implementation is flawed. If our guess for MAXLINES is too large, we waste memory. If it's too small, we lose data. 5.3.2 (Display page) Putting linked lists to work Putting linked lists to work

Let's instead build a "list of lines" and just grow it as we need to. Actually, we'll build a linked list of structs, each of which hold the characters in the line. Here's the declaration of the struct.

struct lineNode { char *line; struct lineNode *next; };

Each lineNode holds the text of the line. A lineNode also contains a pointer to a lineNode, called next. This is the link that forms the linked list. Note the apparently recursive type declaration; this works because of the ability in C to declare a pointer to a struct without knowing the struct's fields. We can declare a pointer to a line node as

struct lineNode *lnode; We can allocate a node dynamically and assign a pointer to it.

lnode = (struct lineNode *) malloc(sizeof(struct lineNode));

There are two nice C idioms for walking down a list. Let's suppose lines is a pointer to the head of a list of lineNodes. (In Scheme it would be the list.) The while loop idiom is

p = lines; while (p != NULL) { ... do something with the node pointed to by p ... p = p->next; }

The for loop idiom is

for (p=lines; p != NULL; p = p->next) { ... do something with the node pointed to by p ... }

The file in ~cs61cl/node/linelist.c shows a solution to the arbitrary line limit problem using a list of these structs. Copy it over, run it and study it. Each time it gets a new line, it adds the line to the head of a list. Then, it walks down the list to print it out. The problem is that the lines are printed in reverse order. Sigh. But we'll get back to that. 5.3.3 (Display page) A self-referencing node A self-referencing node

Supply the missing statements in the code below so that it creates the following structure. Don't

worry about the contents of line.

#include #include

struct lineNode { char * line; struct lineNode * next; }; http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 31 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

int main ( ) { struct lineNode * p;

/* your code goes here */

if (p != p->next) { printf ("ERROR: p!=p>next\n"); } else { printf ("SUCCESS\n"); } return 0; } 5.3.4 (Display page) List reversal List reversal

Returning to the list-processing program that prints the lines of a file in reverse order ... OK, just like in CS 61A we can reverse the list. The first statement of the loop body in the reverse function is provided:

struct lineNode * reverse (struct lineNode *) { struct lineNode * res = NULL; /* so-far reversed list */ struct lineNode * nxt; /* temporary variable */ while (p) { nxt = p->next;

/* you complete this loop */

} return res; } The code you add should transform the list structure at the start of the loop—

to the list structure shown below, at the start of the second of the loop. At the end of the second iteration, res points to the reversed list, and nxt and p are null.

Then uncomment the line containing the call to reverse. The lines from the data file should now be printed in the order they occur in the file. 5.3.5 (Display page) Building a list in order Building a list in order

It would have been better to just add the lines to the end of the list as we went along. We could do this by running down to the end of the list every time we had something to add (feel free to write that version) or we could have kept two pointers: one to the head and one to the tail (the last node). Each time through the loop you might do something like this.

lnode = (struct lineNode *) malloc(sizeof(struct lineNode)); lnode->next = NULL; if (head == NULL) head = lnode; if (tail) tail->next = lnode; tail = lnode;

Modify the main loop in linelist so that the list is built in order and you don't need reverse. 5.3.6 (Self Test) Practice with free Suppose now that we wish to free all the memory that we allocated to build the list of lines from the file. Identify which (if any) of the following program segments, when used as the body of the loop

for (p=lines; p!=NULL; p=p->next) { http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 32 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

... } will successfully free all the allocated memory. The answer may be none or more than one of the choices. Assume that p and temp have been declared as struct lineNode *. Note also that it's illegal to use the contents of any memory that's been freed.

5.3.7 (Brainstorm) More practice with free Give code that correctly frees all the memory allocated for the lines list. 5.4 A few more pointer activities(4 steps) 5.4.1 (Brainstorm) Assigning to an ampersand expression? C doesn't allow the expression &n to appear on the left-hand side of an assignment statement. Give a possible reason for this restriction, and suggest what the statement

&n = p; would mean if it were legal. 5.4.2 (Brainstorm) Disambiguating "*p++" Students learning C are sometimes confused about how to interpret the expression *p++, where p is a pointer. There are two possibilities for what this expression could mean, depending on whether the ++ is applied before or after the pointer dereferencing: either *p++ means (*p)++ or *(p++). With your partner, complete the following main program to determine which interpretation of *p++ is correct. Don’t use any variables other than p and values.

int main ( ) { int *p; int values[2]; // Add some code below. ______*p++; // We ignore the value of the expression, and focus only on its side effects. // Fill in the following blank. if ( ______) { printf ("*p++ means *(p++)\n"); } else { printf ("*p++ means (*p)++\n"); } }

5.4.3 (Display page) **p **p

We've seen the use of plain old variables (e.g. n). We've seen the use of pointer variables (e.g. *p). A variable declared as

type **p; will contain a pointer to a pointer to a variable of the specified type. The diagram below displays an example.

int n = 0; int *p; int **ptr;

ptr p n +---+ +---+ +---+ | | | | | | | *-+------>| *-+------>| 0 | | | | | | | +---+ +---+ +---+ 5.4.4 (Self Test) How to initialize? We want to end up with the diagram below. Fill in the blank to complete the assignment to p.

p = ____ ; The diagram:

ptr p n +---+ +---+ +---+ | | | | | | | *-+------>| *-+------>| 0 | | | | | | | +---+ +---+ +---+

Fill in the blank to complete the assignment to ptr.

ptr = ____ ;

Does it matter what sequence the assignments are made in? E.g. can ptr be initialized before n?

5.5 Homework(1 step) 5.5.1 (Display page) Activities for next lab Activities for next lab

The next lab will be relatively light, reviewing structs, pointers and linked lists, and introducing assembly language. You should read P&H fourth edition sections 2.1, 2.2, 2.3, and 2.7, or P&H third edition sections 2.1, 2.2, 2.3, and 2.6 to prepare for lab. If you finish today's lab activities early, we suggest that you work on homework assignment 3 (now due on Monday evening, February 9). We also hope that you use the extra time in the next lab for work on homework assignment 3. 6 Introduction to assembly language 2009-2-05 ~ 2009-2-06 (2 activities) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 33 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

6.1 Explore assembly language and the MIPS simulator.(7 steps) 6.1.1 (Display page) The instruction set architecture (ISA) The instruction set architecture (ISA)

The basic job of the computer's central processing unit (CPU) is to execute instructions, the primitive operations provided by the computer. Different CPUs implement different sets of instructions. The set of instructions a particular CPU implements is an instruction set architecture (ISA). Examples are the Intel 80x86 (in the Pentium 4), the IBM/Motorola PowerPC (used until recently in Macintosh computers), and MIPS. In the earlier days of computing, the trend was to add more and more instructions to new CPUs to do elaborate operations. Digital Equipment’s VAX computer (released late 1977) had an instruction to multiply polynomials! The 1980's saw a change of direction toward "reduced" instruction sets; the RISC (Reduced Instruction Set Computer) philosophy had two components:

Keep the instruction set small and simple, thereby making it easier to make the hardware fast. Rely on software to implement complicated operations by cleverly composing simple operations.

MIPS is a semiconductor company that built one of the first commercial RISC architectures. We will study the MIPS architecture in some detail in this class (it's also used in the upper-division courses CS 152, 162, and 164). Why use the MIPS instead of a more popular architecture? Two reasons:

It’s simple and elegant, with few messy details. It's widely used in embedded applications, unlike the x86 architecture. (Embedded applications include automobiles, printers, and toasters; there are more embedded computers than there are personal computers!)

6.1.2 (Display page) Registers Registers

Higher-level languages use variables for operands. At the ISA level, variables are represented by memory locations. However, the MIPS architecture does not allow direct arithmetic or logical operations using quantities stored in memory. Instead, it provides a limited number of registers, special locations built directly into the hardware, on which operations may be performed. Since registers are supplied directly in the hardware, they are very fast ​(faster than 1 billionth of a second). MIPS has 32 registers, each 32 bits (a word). (Why 32? More registers would require more complicated hardware and more bits in the instruction.) Registers can be referred to by number (e.g. $3) or by name (e.g. $a0). A common processing pattern is to load information from memory into a register, do some computation on it, and the store it back into memory. Some examples appear below. C statement assembly language translation load b's contents into register $16; a = b; store register $16's contents into a; load b's contents into register $16; a = b + 7; add 7 to the contents of register $16; store register $16's contents into a; load b's contents into register $16; load c's contents into register $17; a = b + c; add the contents of registers $16 and $17, putting the sum into register $18; store register $18's contents into a; Registers have names that suggest their conventional uses. For instance, registers $16 - $23 have names $s0 - $s7 ("s" stands for "saved"); these correspond roughly to C variables. Registers $8 - $15 have names $t0 - $t7; these correspond to temporary variables. We'll explain the names of the other sixteen registers later. In general, you should use the register names rather than their numbers to make your code more readable. Register $0, named $zero, always contains 0. The designers of the MIPS architecture, after studying lots of code, concluded that initialization to and comparison with 0 is such a common operation that it should be made as simple as possible. 6.1.3 (Display page) Instruction format Instruction format

There are four kinds of instructions, each with its own format.

A lw ("load word") or sw ("store word") instruction has the form

opname register, memoryAddr

where opname is either lw or sw, register is a register name or number, and memoryAddr specifies a memory address from which to load a register or to which to copy a register. For now, the memory address will be the name of a location in memory. Arithmetic instructions each have an operation name (e.g. add or sub) and three operands: a destination register that will get the result, plus two source operands, the first of which must be a register name or number. Branch instructions have an operation name—one of the names listed below—two register operands, and a location to "branch" (jump) to if the register values are related in the specified way. operation details name http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 34 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

details name beq branch if the two register values are equal bne branch if the two register values are not equal blt branch if the first register value is less than the second branch if the first register value is less than or equal to the ble second bgt branch if the first register value is greater than the second branch if the first register value is greater than or equal to the bge second A jump instruction has the name j and a label saying where to jump to.

Assembly language lines that start with a dot (e.g. .sdata, .text) are directives to the assembler. They don’t generate actual machine instructions. Labels end with a colon. A label is a name for the address of the labeled thing. Comments start with a # and continue to the end of the line. 6.1.4 (Display page) Working with MARS, a MIPS interpreter Working with MARS, a MIPS interpreter

In order to run your assembly programs, you will use MARS, a simulator for the MIPS architecture. It only runs efficiently on the cory server. To run it in the lab, type

ssh -X cory to the shell, give your password, and then give the command

mars

If you forget to run ssh -X cory to log into cory.cs.berkeley.edu before running MARS, the GUI will not display properly. This seems to be a problem with the Java installation on nova. Assembly programs go in text files with a .s extension. The program must start with the label main: (similar to C) and end with a jr $ra. $ra, one of the registers, is set to the return address whenever a function is called, and main must return just like any other function. MARS also functions as a debugger. You can set breakpoints, step through instructions one by one, view what values are in your registers or in memory, as well as initialize values in registers before your program starts. Load an existing program by selecting Open from the File menu, then assemble it by selecting Assemble from the Run menu, and finally run the program by clicking the green arrow icon. MARS is written in Java, and may be downloaded from http://courses.missouristate.edu/KenVollmar/MARS/ After downloading, it should be double-clickable if you have a Java interpreter installed; run it from the command line by typing

java -jar Mars.jar 6.1.5 (Display page) MARS exercises MARS exercises

First, copy the file ~cs61cl/code/fib.s to your directory. Load it into MARS, assemble the code, and then run the program as described in the previous step. Assume that fib[0] = 0; fib[1] = 1; fib[n] = fib[n-1] + fib[n-2]. The answer is stored in register $a0, which you can view in the register display at the right-hand side of the MARS window. Experiment with MARS, using the fib.s program. Some things to try:

setting a breakpoint, and continuing after hitting a breakpoint; determining the address at which a given variable is stored in memory; changing the contents of a register; changing the contents of a memory location; using help (click the icon with the question mark) to find out what the sll instruction is; using help to distinguish between the .word and the .space directives; using help to learn about the syscall instruction.

Figure out what to change in the program to calculate the 13th Fibonacci number. The fib.s program has some noteworthy features.

It contains a text segment for the program, and a data segment for the variables. You set these up by saying

.text # your program goes here and

.data # your variables go here

The syscall instruction is the way to do various system calls on the MIPS. Here, it's used to exit the program. Online help in MARS reveals its other uses.

6.1.6 (Brainstorm) Decompiling assembly language to C Write a short C program that does essentially what the fib.s assembly language program does. (Note: it is not recursive like the definition of fib.) 6.1.7 (Display page) Programming exercises

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 35 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Programming exercises

Do the following programming exercises.

1. Write a program with data labeled x and absx, that stores the absolute value of x in absx. One of the instructions in your program should be labeled allDone:. Set a breakpoint at that label so you can verify that absx contains the right thing. 2. Write a program with data labeled x, y, and max, that stores the larger of the two values in x and y into max. Supply an allDone label as in the previous exercise. 3. Write a program that, given data values labeled x, n, and power, computes the quantity xn. The following instruction sequence gets the product of $s0 and $s1 into $s2:

mult $s0,$s1 mflo $s2

6.2 Homework and reading(3 steps) 6.2.1 (Display page) Upcoming material and homework summary Upcoming material and homework summary

Lab activities on Tuesday and Wednesday will explore assembly language equivalents of common C operations. Sections 2.7 and 2.14 (P&H fourth edition) or sections 2.6 and 2.15 (P&H third edition) will be useful background. The next two steps describe homework assignment 4. Expertiza will know them as "code commenting" and "adding machine". They are due at 11:59 on Friday evening, February 13, together with your peer evaluation of homework 3. Reminder: homework assignment 3 and peer evaluations for homework 2 are due Monday evening. 6.2.2 (Display page) Homework 4, problem 1 Homework 4, problem 1

The following code fragment, online in ~cs61cl/code/hw4q1.s, processes two arrays and produces an important value in register $v0. Assume that each array is indexed starting at 0, that the base addresses of the arrays are stored in $a0 and $a1 respectively, and their sizes (in words) are stored in $a2 and $a3, respectively. Add comments to the code that describe what the code is doing and what role each register is playing (a comment per line wouldn't be a bad ratio). Also describe in one sentence what this code does. Specifically, when control reaches the instruction labeled done, what will be in $v0?

sll $a2, $a2, 2 sll $a3, $a3, 2 add $v0, $zero, $zero add $t0, $zero, $zero outer: add $t4, $a0, $t0 lw $t4, 0($t4) add $t1, $zero, $zero inner: add $t3, $a1, $t1 lw $t3, 0($t3) bne $t3, $t4, skip addi $v0, $v0, 1 skip: addi $t1, $t1, 4 bne $t1, $a3, inner addi $t0, $t0, 4 bne $t0, $a2, outer done: 6.2.3 (Display page) Homework 4, problem 2 Homework 4, problem 2

Design and implement a MIPS assembly language version of the adding machine from homework assignment 1. The specifications are the same for that assignment. Input to the program will represent integer values, typed one per line without leading or trailing white space. (Don't worry about incorrectly formatted input.) Input should be handled as follows:

a nonzero value should be added into a subtotal; a zero value should cause the subtotal to be printed (in the form "subtotal" followed by a single space followed by the subtotal) and reset to zero; two consecutive zeroes should cause the total of all values input to be printed (in the form "total" followed by a single space followed by the total) and the program to be terminated.

Your solution should be well commented—one comment per line would not be overkill—and should use white space appropriately to delimit logical sections of the program. You may use syscall to read an entire integer from the input; unlike in homework assignment 1, you don't need to build the integer out of its constituent digits. Also, MARS seems to echo user input, so it will appear interspersed with your subtotal/total messages. Print a newline character at the end of each of your output lines to improve the readability of the output. One last thing: only the $s registers are safe from being trashed by syscall. 7 Structs and Stuff 2009-2-05 ~ 2009-2-06 (3 activities) 7.2 Trying it out(6 steps) 7.2.1 (Brainstorm) Smart-er Strings Last week in lab we worked with a struct called smartString. For reference the struct definition is below.

struct smartString { char * chars; int length; };

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 36 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

We need a SMARTER string. We want to augment smartStrings so that can have brilliantStrings that know if the word they contain is a bad word.

struct brilliantString { struct smartString * smtString; int isBadWord; };

struct brilliantString * newBrillaintString (struct smartString* origSmartString, int isBadWord) { ... } Fill in the blanks to write the procedure newBrilliantString. 7.2.2 (Brainstorm) Common Errors What are some problems you think people might have encountered writing a solution to the last question, newSmartString? What might they have forgotten or included unnecessarily? 7.2.3 (Brainstorm) Cons 1 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node newNode; newNode.data = newCar; newNode.next = list; return &newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 7.2.4 (Brainstorm) Cons 2 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode; newNode->data = newCar; newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 7.2.5 (Brainstorm) Cons 3 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = newCar; newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 7.2.6 (Brainstorm) Cons 4 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = (char *) malloc ( (sizeof(char)) * strlen(newCar) ); newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 37 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

8 Structs and Stuff-A 2009-2-05 ~ 2009-2-06 (3 activities) 8.2 Trying it out(6 steps) 8.2.1 (Brainstorm) Smart-er Strings Last week in lab we worked with a struct called smartString. For reference the struct definition is below.

struct smartString { char * chars; int length; };

We need a SMARTER string. We want to augment smartStrings so that can have brilliantStrings that know if the word they contain is a bad word.

struct brilliantString { struct smartString * smtString; int isBadWord; };

struct brilliantString * newBrillaintString (struct smartString* origSmartString, int isBadWord) { ... } Fill in the blanks to write the procedure newBrilliantString. 8.2.2 (Brainstorm) Common Errors What are some problems you think people might have encountered writing a solution to the last question, newSmartString? What might they have forgotten or included unnecessarily? 8.2.3 (Brainstorm) Cons 1 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node newNode; newNode.data = newCar; newNode.next = list; return &newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 8.2.4 (Brainstorm) Cons 2 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode; newNode->data = newCar; newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 8.2.5 (Brainstorm) Cons 3 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = newCar; newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 8.2.6 (Brainstorm) Cons 4 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; }; http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 38 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = (char *) malloc ( (sizeof(char)) * strlen(newCar) ); newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 9 Structs and Stuff-1 2009-2-05 ~ 2009-2-06 (3 activities) 9.2 Trying it out(6 steps) 9.2.1 (Brainstorm) Smart-er Strings Last week in lab we worked with a struct called smartString. For reference the struct definition is below.

struct smartString { char * chars; int length; };

We need a SMARTER string. We want to augment smartStrings so that can have brilliantStrings that know if the word they contain is a bad word.

struct brilliantString { struct smartString * smtString; int isBadWord; };

struct brilliantString * newBrillaintString (struct smartString* origSmartString, int isBadWord) { ... } Fill in the blanks to write the procedure newBrilliantString. 9.2.2 (Brainstorm) Common Errors What are some problems you think people might have encountered writing a solution to the last question, newSmartString? What might they have forgotten or included unnecessarily? 9.2.3 (Brainstorm) Cons 1 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node newNode; newNode.data = newCar; newNode.next = list; return &newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 9.2.4 (Brainstorm) Cons 2 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode; newNode->data = newCar; newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 9.2.5 (Brainstorm) Cons 3 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = newCar; newNode->next = list; return newNode; } http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 39 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 9.2.6 (Brainstorm) Cons 4 Given the following struct, someone tried to write the procedure cons that would put a new node at the front of the list.

struct node{ char * data; struct node *next; };

/* return the result of "consing" an element to the front of the given list */ struct node *cons (char *newCar, struct node *list) { struct node *newNode= (struct node *) malloc (sizeof (struct node)); newNode->data = (char *) malloc ( (sizeof(char)) * strlen(newCar) ); newNode->next = list; return newNode; } What would happen if you tried to compile and run this code? Please be as specific as possible, however you should not actually compile the program. 10 MIPS equivalents of C constructs 2009-2-10 ~ 2009-2-11 (5 activities) 10.2 Explore ways of filling registers.(6 steps) 10.2.1 (Display page) Access to registers and memory Access to registers and memory

As we noted earlier, all computation must be done in registers, quickly accessible storage supplied along with the central processing unit. There are 32 registers, each holding a word (32 bits). Copies between registers and memory are done by loads, which copy memory values into registers, and stores, which copy register values back to memory, as portrayed in the diagram below.

One of the registers, $0, is read-only. It always contains 0. 10.2.2 (Display page) The lw and sw instructions The lw and sw instructions

The lw ("Load Word") instruction copies a word of memory into a register. Its counterpart, sw ("Store Word"), copies a register value into a location. The simplest forms of these instructions are

lw destination_register, source_label sw source_register, destination_label where register is the name or number of a register, and label names a location in memory. The label should appear as the first thing on a line of an assembly language program, followed by a colon (":"). More generally, we can use another register to tell us where in memory to copy from. The syntax of this more general form of lw is as follows:

lw destination, byte_offset(register) where destination is the destination register, register is a register containing a pointer to memory, and byte_offset is a numerical offset in bytes. The byte offset plus the pointer in the register is called the effective address. For example, if $t0 contains a pointer into memory, then the instruction lw $t1,8($t0) would be executed by first computing the effective address—8 plus whatever is in $t0—then copying whatever is at that location into register $t1. A label is a different kind of offset; it represents a memory location all on its own. Thus the instruction lw $t0,count is the same as lw $t0,count($0) (recall that $0 always contains 0). The byte offset is most useful when the source register contains a pointer to an array or structure. If $t0 contains the address of the first element of an array of int named values, then the instruction lw $t1,4($t0) copies values[1] into $t1. Similarly, if $t0 contains a pointer to a struct point declared as

struct point { http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 40 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

int x; int y; };

then lw $t1,0($t0) copies x's value into $t1, and lw $t1,4($t0) does the same thing with y's value. 10.2.3 (Brainstorm) Translation of *x = *y Give a MIPS program segment that implements the C statement

*x = *y

Assume that register $s0 contains the x value and register $s1 contains the y value. 10.2.4 (Display page) Loading other types of information Loading other types of information

Types in C are associated with declarations in C. However, all memory is typeless at the assembly language level; for example, the way to interpret a given value as a character is to use a different kind of load instruction with it. An example is the lb instruction, which copies the ASCII character value at the given effective address into the given register. The corresponding store instruction is sb. The effective address for a lw or sw must be divisible by 4, aligned to a word boundary. Verify this by running the following program in MARS:

.data n: .word 0 c1: .byte 0 c2: .byte 1 .text main: lw $t0,c2 10.2.5 (Brainstorm) Another alignment question Suppose that $t0 contains a pointer to somewhere in the data segment. Does the instruction lw $t1,3($t0)

a. always crash, or does it b. crash only sometimes?

Briefly explain. 10.2.6 (Display page) Two other kinds of load Two other kinds of load

The li ("Load Immediate") instruction is used to store an integer value into a register without accessing memory. Its first argument is the register to load; its second is the integer value to put there. The la ("Load Address") instruction stores the address of the second argument into the register first argument. It corresponds to the & operator in C. The following instruction sequences do the same thing, namely, they implement the statement b = a+5. Assume that labels a and b have been associated with memory locations in the data segment. lw $t0,a lw $t0,a la $t0,a li $t1,5 addi $t0,$t0,5 lw $t0,0($t0) add $t0,$t0,$t1 sw $t0,b addi $t0,$t0,5 sw $t0,b sw $t0,b

10.3 Explore memory(7 steps) 10.3.1 (Display page) Variables and memory Variables and memory

Here is a simple C program that uses global and local variables.

#include

int x = 7; int y = 9; int n;

int main () { int t0 = 1; n = (x+y) - ((x-t0) + (t0-y)) + x; printf("n = %d\n", n); }

Paste it into a file called temps1.c. Compile and run it so that you know what it does. Let's convert this into a C program that is essentially at machine language level. Create a new file temps2.c by introducing additional temporary local variables that hold the individual subexpressions of

(x+y) - ((x-t0) + (t0-y)) + x For example,

int t1 = x+y; so that you have a longer C program that has at most one addition or subtraction on a line. Do not simplify the expression. Do exactly the same computation, but one step at a time. You may reuse temporary variables when you no longer need the old value, such as t3 = t3 + x. Compile and run your program to make sure you didn't break it. Feel free to fiddle with the initial values of the variables. 10.3.2 (Display page) Variables to registers Variables to registers http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 41 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Now convert your little C program to something even closer to machine language. Before you can use x or y, you must first assign it to a temporary variable. And finally, you will assign the temporary result to n. So each line may do one of

read a global variable, write a global variable, or do one arithmetic operation,

but only one of these. Put this in a new file temps3.c. It will look something like this.

int x = 7; int y = 9; int n;

int main () { int t0 = 1; //set n = (x+y) - ((x-t0) + (t0-y)) + x; int t1 = x; ... int t2 = t1-t0; ... n = t3; printf("n = %d\n", n); } If you squint a little bit, you can think of these temporary variables like registers—special fast storage in the machine that is used to hold temporary values while you are working on them. 10.3.3 (Self Test) Using load and store Among the kinds of storage you've seen are local variables that are in the stack frame of the function. In addition, you used malloc to allocate storage from the heap and then used a variable to hold a pointer to it. Let's speculate what must be going on at the machine level. Suppose register $sp contains the address of the top of the stack. What instruction would copy the top stack element into register $t0?

Suppose register $v0 contains an address returned by malloc. What instruction would copy register $t0 into the first word of the allocated storage?

10.3.4 (Display page) Converting from ASCII digit to number Converting from ASCII digit to number

Assume that the two bytes at addresses x and x+1 are ASCII digits. Supply the load instructions that implement the comments in the code below, and test your solution in MARS.

.data x: .byte '3', '4' .text # load the byte at the location labeled x into $t0 # load the next byte into $t1 # load the character '0' into $t2 sub $t0,$t0,$t2 sub $t1,$t1,$t2 allDone: 10.3.5 (Brainstorm) Initializing an array element Fill in each blank to implement the C statement a[3] = x, where x and the elements of a are all 4- byte quantities. Also, briefly explain your answer.

____ $t0,a ____ $t1,x sw $t1,12($t0) 10.3.6 (Display page) Common pitfalls Common pitfalls

Perhaps because of the confusion between the contents of a memory cell and the address of that cell, assembly language programmers often run into trouble distinguishing a pointer from what it's pointing to. (This problem shows up in C as well.) Keep the following in mind:

If you see an operand like 8($t0) in a lw or sw instruction, you should make sure that $t0 contains a valid address. More generally, a load is processed by first computing the effective address. For 8($t0), that's the contents of $t0 plus 8. Then the contents of the addressed memory cell is retrieved.

One more pitfall is forgetting that sequential word addresses don't differ by 1. If $t0 contains an address of a 4-byte array element, then the address of the next array element is 4($t0), not 1($t0). 10.3.7 (Display page) Pseudo-instructions Pseudo-instructions

You may observe that MARS's idea of an instruction doesn't always match your own. For example, when you use the assembly language instruction lw $t0,values, MARS converts it into the pair of instructions

lui $1,some_value lw $t0,some_offset($1)

(lui = "Load Upper Immediate"; it copies the operand into the top 16 bits of the register.) This is http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 42 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

because the version of lw that uses a label is not built into the MIPS instruction set; it's a pseudo- instruction that has to be translated by the assembler into code that does consist of native instructions. Pseudo-instructions represent a compromise between convenience for the programmer and simplicity of the computer architecture. Interestingly, the MIPS system designers did not provide a "subtract immediate" instruction, instead requiring programmers to use addi. Another example of a pseudo-instruction is la ("load address"), which for some arguments iis translated into a lui and an ori ("OR Immediate"). You don't have to worry yet about what instructions are pseudo-instructions and what ones are native, or how the translation between the two is done; we'll discuss this next week when we cover instruction formats. You should know, however, why the MIPS help facilty in MARS contains a section for pseudo-instructions as well as a section for basic instructions. 10.4 Examine MIPS implementation of more advanced C constructs.(11 steps) 10.4.1 (Display page) Register use in expression evaluation Register use in expression evaluation

We've seen the use of registers to evaluate simple expressions. For example, here is how to implement the C statement a = b + c + d - e, where the variables are all of type int:

lw $t0,b # running total lw $t1,c add $t0,$t0,$t1 lw $t1,d add $t0,$t0,$t1 lw $t1,e sub $t0,$t0,$t1 sw $t0,a One can easily devise statements that require more than two registers to implement, for example, the statement f = (g+h) - (i+j):

lw $t0,g lw $t1,h add $t0,$t0,$t1 # compute temp1 = g + h lw $t1,i lw $t2,j add $t1,$t1,$t2 # compute temp2 = i + j sub $t0,$t0,$t1 # compute f = temp1 - temp2 sw $t0,f A good compiler tries to keep the most frequently used variables in registers to avoid reloads. The less common variables are spilled to memory. 10.4.2 (Display page) Other features of MARS Other features of MARS

Transform your program temps3.c essentially line by line into an assembly language file temps.s. It will not print the result. We'll use MARS to see the value in n at the end. Here's a starting point.

.data x: .word 7 y: .word 9 n: .word 0

.text main: li $t0, 1 lw $t5, x lw $t6, y add $t1, $t5, $t6 ... finish: sw $t1, n li $v0, 10 syscall

Complete this program so that it mimics temps3.c. Each temporary variable is replaced a register. Arithmetic operators are operation codes. Note the use of load, store and load immediate. Load your assembly language program into MARS. Assemble it. Single-step it to the end and verify that it does what your C program does. Try it on other inputs by modifying the values at the memory locations for x and y. Rerun the program and notice the special register pc that is almost at the bottom. What does it contain before the program takes its first step? After you step it once? At each step? PC stands for Program Counter (long before the other PC). MARS prints the binary machine language instruction for each assembly language instruction. Identify the op codes for the various instructions, and check your guess with P&H section 2.5 (fourth edition) or section 2.4 (third edition). 10.4.3 (Display page) Branches and loops Branches and loops

The fundamental operation of a computer is the instruction cycle. It goes like this

Fetch the instruction at the address in PC from memory. Decode that instruction to determine what to do (the operation) and what to do it on (the operands). Fetch the operands. Execute the operation on the operand values. Store the result. Update the PC to be the address of the next instruction.

That's it. Big or small, fast or slow. All computers just do the instruction cycle over and over and over. In the MIPS, instructions are simple. Each does essentially one thing. So far, we have seen instructions that load from memory, store to memory, or do arithmetic on registers. Note that lw and http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 43 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

sw also do some arithmetic to compute the effective address. There are lots more instructions to do logical operations, make comparisons, and do shifts and masks. All the operators in chapter 2 of K&R have corresponding machine instructions. Furthermore, all of these in the MIPS update the PC in a simple way: adding 4 so that it points to the next 32-bit instruction. (This is another beautiful thing about the MIPS. All the instructions are the same size. On your x86 they may be as small as a single byte or over a dozen, with numerous possibilities in between!) There is another class of instructions whose whole purpose is to determine the next instruction, i.e., to update the PC. They change the flow of control. Instead of the program executing in a straight line, it may jump to distant place or branch, depending on some value computed along the way. Not surprisingly, we call these "jump" and "branch" instructions. A jump instruction is a unconditional change in the flow of control. Here's an example.

.data x: .word 7 y: .word 99 n: .word 0

.text main: lw $t0, x lw $t5, y loop: sub $t5, $t5, $t0 j loop

finish: sw $t1, n # never get here li $v0, 10 syscall Cut and paste it into a file and single-step it in MARS. Notice what happens when you hit the jump. Where does it jump to? Can you find the target address in the jump instruction? The label has been turned into an address. Check out what happens to the PC. Put some extra instructions in, or maybe some more labels and jumps. MIPS also has several conditional branch instructions. These decide whether the next instruction to execute is the one that follows consecutively in memory (PC+4, just like straight-line code) or one at another target address. operation details name beq branch if the two register values are equal bne branch if the two register values are not equal blt branch if the first register value is less than the second ble branch if the first register value is less than or equal to the second bgt branch if the first register value is greater than the second branch if the first register value is greater than or equal to the bge second Change the jump instruction in the program above to be

bge $t5 $0 loop Now what does this do when you step through the program? 10.4.4 (Brainstorm) Playing compiler How would the following C fragment be implemented at the machine level?

while (i < n) { ... do something; i++; } What about the following?

if (i < n) { ... do something; i++; } 10.4.5 (Display page) More playing compiler More playing compiler

Loops, conditionals, and switches are all implemented with jumps and branches. Some examples appear below. These instructions can be used to implement an if-then-else as shown in the table. C statement rewrite MIPS translation branch args,L1 if (condition) goto L1: if (condition) translation of clause2 clause2; clause1 j L2 goto L2; else L1: L1: clause1; clause2 translation of clause1 L2: L2: The branches described above all do signed comparisons. To do unsigned comparisons, we use a "Set on Less Than" instruction: instruction explanation sets dest_reg to 1 if reg1 is less than reg2, and to 0 slt dest_reg,reg1,reg2 otherwise. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 44 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

slt dest_reg,reg1,reg2 otherwise. sets dest_reg to 1 if reg is less than int, and to 0 slti dest_reg,reg,int otherwise. sltu dest_reg sets dest_reg to 1 if reg1 is less than reg2—doing an ,reg1,reg2 unsigned comparison—and to 0 otherwise. sets dest_reg to 1 if reg is less than int—doing an sltiu dest_reg,reg,int unsigned comparison—and to 0 otherwise. The branch instructions are really pseudo-instructions implemented by using slt instructions. 10.4.6 (Brainstorm) Code analysis Here is some code.

add $t0,$0,$0 loop: beq $a1,$0,finish add $t0,$t0,$a0 addi $a1,$a1,-1 j loop finish: add $v0,$t0,$0

Assume that $a0 and $a1 initially contain the integers a and b. In terms of a and b, what does the code leave in $v0? 10.4.7 (Brainstorm) Code optimization Here is the code from the previous step.

add $t0,$0,$0 loop: beq $a1,$0,finish add $t0,$t0,$a0 addi $a1,$a1,-1 j loop finish: add $v0,$t0,$0 Observe that the loop contains four instructions. Restructure the code so that it computes the same result but with a loop that contains only three instructions. (You will need to add code outside the loop.) 10.4.8 (Brainstorm) Code debugging The following program tries to copy words from the address in $a0 to the address in $a1, counting the number of words copied in $v0. The program should stop copying when it finds a word equal to 0; the terminating word should be copied but not counted. The program will not preserve the contents of registers $v1, $a0, or $a1.

addi $v0,$0,0 # initialize count loop: lw $v1,0($a0) # get next word from source sw $v1,0($a1) # copy to destination addi $a0,$a0,4 # advance source pointer addi $a1,$a1,4 # advance destination pointer beq $v1,$0,loop There are several bugs in this program segment. Find them and describe how to fix them. 10.4.9 (Brainstorm) A bounds-check short cut Suppose you want to check if $t0 contains a legal array index for an array indexed from 0 to the value in $t1 minus 1. The following code does this check.

sltu $t2,$t0,$t1 beq $t2,$0,indexOutOfBounds Explain why this works. 10.4.10 (Display page) Setting up an array in the data segment Setting up an array in the data segment

The .word directive can take several arguments, thus allowing the initialization of a simple simulated array. An example: to implement the C statement

int values [ ] = {3, 1, 4, 5, 9}; we can say

values: .word 3, 1, 4, 5, 9

The .byte directive works the same way. Similar mechanisms are provided for initializing strings:

.asciiz string_enclosed_in_double_quotes .ascii string_enclosed_in_double_quotes

Both set up a sequence of characters. The .asciiz directive adds a null character to the end of the sequence. You will notice in the MARS display that characters are stored right to left within a word; e.g. "mike" is stored as

0x656b696d ('m' = 0x6d, 'i' = 0x69, 'k' = 0x6b, and 'e' = 0x65.) This is because bytes within a word in MARS are addressed from right to left. On some computers, the byte addressing is reversed. Patterson and Hennessy discuss the difference in section 2.3, page 84 (fourth edition) or page 56 (third edition). 10.4.11 (Display page) Summarizing Summarizing

Let's put together what you've seen so far in a simple example. A simple C program like http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 45 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

int A[] = {1,2,3,4,5}; int n = 0; int main() { int i; int sum = 0; for (i=0; i < 5; i++) sum += A[i]; n = sum; } might translate into this.

.data A: .word 1, 2, 3, 4, 5 n: .word 0 .text main: and $s0, $zero, $zero la $s1, A li $t0, 5 loop: ble $t0, 0, finish lw $t1, 0($s1) add $s0, $s0, $t1 sub $t0, $t0,1 add $s1, $s1, 4 j loop

finish: sw $s0, n li $v0, 10 syscall Cut and paste this into a file. Load it into MARS and assemble it. Step through it and notice what happens to the registers at every step. How is the array reference implemented? Rewrite the C to do what the assembly actually does. 10.5 Homework and readings(2 steps) 10.5.1 (Display page) Homework and readings Homework and readings

Next lab will be an introduction to procedure calls in MIPS assembly language. As background, you should read Patterson and Hennessy, section 2.8 (fourth edition) section 2.7 (third edition). By the start of your next lab, post a question about the project 1 specification in the following discussion step. The question should be one that either you have or you suspect that one of your classmates may have. (It shouldn't match any of the previously posted questions.) Then, in lab on Thursday or Friday, post an answer to one of the questions. In the upcoming Tuesday/Wednesday lab, you will be expected to show progress on project 1. Reminder: homework assignment 4 and peer evaluations of homework assignment 3 are due on Friday evening. 10.5.2 (Discussion Forum) Q/A on project 1 By the start of your lab on Thursday/Friday, post a question about project 1. In that lab, answer one of the questions posted by your labmates. 11 MIPS procedures 2009-2-12 ~ 2009-2-13 (3 activities) 11.1 Contribution to project 1 discussion(1 step) 11.1.1 (Display page) Responding to a question about project 1 Responding to a question about project 1

Reminder: you're supposed to answer one of your labmates' questions about project 1 today in lab. Go back to the discussion step at the end of the Tuesday/Wednesday lab activities to do so. Don't answer a question that's already been answered unless you can add something significant to the existing answer. 11.2 MIPS conventions for calling and returning from functions(13 steps) 11.2.1 (Display page) Calling and returning from a very simple function Calling and returning from a very simple function

There are three components to a MIPS function call.

Linkage: How to jump to the start of the function, and how to get back to the caller when the function finishes. Communication: How to pass arguments and return values in registers. Use of the stack.

We'll start by examining the simplest possible function, one that doesn't do anything at all but return. What the caller does What the callee does call: jal lazyfunct lazyfunct: jr $ra The jal instruction ("Jump And Link") does two things. First, it places the address of the instruction that immediately follows the jal into register $ra ("Return Address"; it's register $31); that's the link. Then it jumps to the named location. Return from the function is done with the jr instruction. It jumps to the address in the specified register—in this case, the address that we just put into $ra via the jal. 11.2.2 (Display page) Jumping there and back Jumping there and back

Let's try to build up to call-and-return of a function with what we have so far. You got familiar with Jumps and Branches (conditional jumps) in the previous lab. A subroutine or function is a set of http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 46 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

instructions that accomplished some useful task that might be called from several points in a program. First, let's try it with a direct jump in MARS. Here's an example MIPS program. Paste it into a file, load it into MARS, assemble and step through it. The idea here is that decr is a function that decrements the value that is in register $a0, putting its result in register $v0. Pretty simple, but you can imagine doing more interesting things. Jumping to decr is the call. How does it get back?

.data x: .word 5 y: .word 0

.text main: lw $a0, x j decr back: move $a0,$v0

finish: sw $a0, y li $v0, 10 syscall

decr: addi $v0, $a0, -1 j back

Modify the program so that it "calls" decr in a loop until the value is decremented to zero. That is, something like

loop: j decr back: move $a0,$v0 bne $a0,$zero, loop Step through it and see how the call/return appears to work. See how the jump modifies the PC. OK, now let's break it. Modify the original program so that it "calls" decr twice in sequence. What happens when the second one tries to "return"? In general, a function can be called from lots of places in a program. How do you think a function returns to where it is called? Clearly it can't just Jump directly back. 11.2.3 (Display page) Call and return Call and return

They say all things in computer science are solved by a level of indirection. That is the answer to the call-return mystery too. For the function to return to wherever it was called requires an indirect jump. Before we can do that, however, we need to get the address we want to jump to into a register. To do this, we'll use a special kind of jump instruction that records where it came from. In MIPS, this is called jump-and-link, jal. Here's an example.

.data x: .word 5 y: .word 0 .text main: lw $a0, x jal decr move $a0,$v0 jal decr move $a0,$v0 jal decr move $a0,$v0 jal decr move $a0,$v0

finish: sw $a0, y li $v0, 10 syscall

decr: addi $v0, $a0, -1 jr $ra

Cut-and-paste it into a file and step through it with MARS. It calls decr three times. Notice what $ra contains when the addi is executed. What happened to the back label? The jr is the indirect jump. It jumps to the address specified in a register. Yes, this is essentially a pointer to code. jal creates the pointer (or link). jr uses it. 11.2.4 (Self Test) Simulating jal Fill in the blank so that the effect of the given code segment matches the behavior of the jal callAddr instruction.

___ $ra,rtnAddr j callAddr rtnAddr:

If pc holds the address of a jal instruction, what value gets stored in $ra as a result of executing the jal?

11.2.5 (Display page) Summation, revisited Summation, revisited

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 47 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

In the previous lab, the step named "Summarizing" gave you some code that summed the elements of an integer array. Now that you know how to create functions in assembly language, revise that code so that the main routine calls a function sumarray, passing it the array (i.e., the address of the array's first element) and its length n. It should return the sum of the first n array elements. 11.2.6 (Display page) A slightly more complicated function A slightly more complicated function

We move on to consider a function that takes an argument and returns a value, namely the diff function that returns the absolute value of the difference of its arguments. Registers are used to communicate between the caller and the callee: what the caller does what the callee does diff: sub $v0,$a1,$a0 lw $a0, value0 bgtz $v0, pos lw $a1, value1 sub $v0, $zero,$v0 jal diff pos: jr $ra In general, arguments get passed in $a0, $a1, $a2, and $a3. The register $v0, sometimes along with $v1, is used to store the return value. ($v0 and $v1 together are used to return a 64-bit quantity.) 11.2.7 (Display page) A more complicated function A more complicated function

We move on to a function that computes the "manhattan distance" between two points (x0,y0) and (x1,y1): the sum of |x1-x0| + |y1-y0|. Pseudocode to perform this computation appears below.

1. Assume x0, y0, x1, and y1 are passed in $a0, $a1, $a2, and $a3 respectively. 2. Call diff to compute |x1-x0|. 3. Call diff again to compute |y1-y0|. 4. Add the two distances. 5. Return the resulting value in $v0.

Think a little about what the assembly code for this would look like. We'll develop it as we go along. Part of the trick is that we have a limited set of registers and they are to be used for specific purposes, so we need to move things around. 11.2.8 (Brainstorm) A buggy implementation Let's take a simple cut at a function that computes the distance by calling dist on x0, x1 and then y0, y1. What is wrong with the following MIPS function? dist: add $s1,$a1,$0 # save y0 move $a1, $a2 # move x1 for call to diff jal diff # call diff move $a0, $s1 # get y0 move $a1, $a3 # move y1 for call to diff jal diff add $v0,$v0,$s3 # abs(y1-y0) + abs(x1-x0) jr $ra There are two bugs in the code. Identify them, and explain how you found them. 11.2.9 (Display page) Saving information between function calls Saving information between function calls

Complications arise in the processing of a function that calls another. For instance, in the distance calculation, we have

# Put x0 and x1 into $a0 and $a1. jal diff # Put y0 and y1 into $a0 and $a1. jal diff

Immediately after returning from the first call to diff, we need to save $v0 somewhere, since the second call to diff is going to overwrite it. Moreover, the jal instructions will modify $ra, so we have to save its original value in order to be able to return to the caller of the distance function. The need for temporary registers in a function is quite common, so conventions have been set up for register use. There are a set of registers that the callee is free to change; the caller must assume that they will be trashed and not leave anything of value in them across the call.

$t0, $t1, ..., $t9 ("t" is for "temporary") may be changed by a callee. $a0, $a1, $a2, and $a3 may also be changed by a callee. $s0, $s1, ..., $s7 ("s" is for "saved" or "secure") may not be changed by a callee. If the callee changes them, it has to restore their original values before it returns.

Here's some almost-complete code for the distance function.

dist: # code to save $s0-$s3​ goes here move $s2,$ra # save return address

move $s0,$a1 # save y0 move $s1,$a3 # save y1 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 48 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

move $a1,$a2 # set up args for call jal diff move $s3,$v0 # save abs(x1-x0) move $a0,$s0 # set up y0 move $a1,$s1 # and y1 jal diff add $v0,$v0,$s3 # abs(y1-y0) + abs(x1-x0)

move $ra,$s2

# code to restore $s0-$s3 goes here jr $ra # return 11.2.10 (Display page) Stack management Stack management

In learning C, we learned about keeping local variables on the stack. We have just seen that at the machine level we can pass some values directly in registers. Indeed, often the compiler can keep local variables in registers, rather than loading and storing them from and to memory. But in many cases, it really needs the stack in memory. How is that stack implemented? When you write a C program with a stack as a data structure, you allocate an array and declare a variable to be a pointer to the top of the stack. Then you can push and pop. At the machine level, the memory of the actual machine is the storage and the top of stack is kept in a register. One of the MIPS registers, $sp (or $29) is dedicated by convention to the management of a stack that functions can use to save information (e.g. contents of $s registers).

1. Again by convention, $sp always points to the top of the stack, i.e. the smallest occupied address. 2. At the start of a function comes prologue code: $sp is decremented by the total amount of space for information to be saved, and then registers whose contents are to be preserved are stored into the stack. 3. At the end of the function is the epilogue code. We load the values saved on the stack into the corresponding registers, then increment $sp and return.

Here are the prologue and epilogue code for the dist function. Prologue Epilogue ... dist: add $ra,$s2,$0 addi $sp,$sp,-16 lw $s3,12($sp) sw $s0,0($sp) lw $s2,8($sp) sw $s1,4($sp) lw $s1,4($sp) sw $s2,8($sp) lw $s0,0($sp) sw $s3,12($sp) addi $sp,$sp,16 ... jr $ra In this case, we have to save $ra into $s2 somewhere in the body of the function. We need to move the return address back into $ra before we restore $s2. Which way is the stack growing in this example? 11.2.11 (Brainstorm) Better stack management Below is the complete code for the dist function. It is possible to rearrange the code slightly to save a few instructions, even while retaining the calls to the dist function. Describe how you would modify the code (other than by changing jal dist to not be a function call) to reduce the number of statements executed while still adhering to conventions for register management.

dist: addi $sp,$sp,-16 # prologue sw $s0,0($sp) sw $s1,4($sp) sw $s2,8($sp) sw $s3,12($sp) move $s2,$ra # save return address

move $s0,$a1 # save y0 move $s1,$a3 # save y1 move $a1,$a2 # set up args for call jal diff move $s3,$v0 # save abs(x1-x0) move $a0,$s0 # set up y0 move $a1,$s1 # and y1 jal diff add $v0,$v0,$s3 # abs(y1-y0) + abs(x1-x0)

move $ra,$s2 # epilogue lw $s3,12($sp) lw $s2,8($sp) lw $s1,4($sp) lw $s0,0($sp) addi $sp,$sp,16 jr $ra 11.2.12 (Display page) Odds and ends Odds and ends

Note that a call to a recursive function is handled just the same as a call to a nonrecursive function. Both do the same things to the stack. Since only four registers are dedicated to arguments, a function with more than four arguments must provide some other way to send the extra arguments to the callee. The stack is used for this purpose. The extra arguments get stored at the top (the smallest address) of the caller's stack space. That way, the callee can easily access them, since http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 49 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

they're just beyond the callee's stack space. You'll see a bit more on this in the next project. 11.2.13 (Display page) Running your dist program in MARS Running your dist program in MARS

To run your code under MARS, you will want to have a little wrapper. The code at the top makes the call to main and issues the terminating syscall upon return. The main program sets up some sample values and calls the dist function. You'll want to use a similar wrapper in your upcoming assignments. Step through the code below. Notice, before you take the first step, what is the value of the $sp register. You can select the stack segment in the MARS data viewer. Step through each of the calls and observe how the registers are used, how the stack is used, and how the flow of control takes place. Here we've moved the final result of diff into $t0 just so we could see it when we are done.

# default code start to call main and terminate jal main li $v0, 10 syscall

# int diff(x,y) returns abs(y-x) # diff: sub $v0,$a1,$a0 bgtz $v0, pos sub $v0, $zero,$v0 pos: jr $ra

# int dist(x0, y0, x1, y1) returns abs(x1-x0) + abs(y1-y0) # dist: addi $sp,$sp,-16 # prologue sw $s0,0($sp) sw $s1,4($sp) sw $s2,8($sp) sw $s3,12($sp) move $s2,$ra # save return address

move $s0,$a1 # save y0 move $s1,$a3 # save y1 move $a1,$a2 # set up args for call

jal diff move $s3,$v0 # save abs(x1-x0) move $a0,$s0 # set up y0 move $a1,$s1 # and y1 jal diff add $v0,$v0,$s3 # abs(y1-y0) + abs(x1-x0)

move $ra,$s2 # epilogue lw $s3,12($sp) lw $s2,8($sp) lw $s1,4($sp) lw $s0,0($sp) addi $sp,$sp,16 jr $ra

main: add $s0,$ra,$0 li $a0,1 # x0 li $a1,5 # y0 li $a2,4 # x1 li $a3,1 # y1 jal dist # compute manhattan distance or $t0,$v0,$0 add $ra,$s0,$0 jr $ra

11.3 Homework and readings(5 steps) 11.3.1 (Display page) Homework activities Homework activities

There are several homework activities for the coming week.

In lab on next Tuesday and Wednesday, you will be expected to show progress on project 1. You will earn 1 homework point for doing each of the following: explaining what changes are necessary to adapt the symbol table code in either your homework 3 solution or the posted solution to the project; explaining what changes are necessary to adapt the "get the next word" code from your homework 3 solution or the posted solution to the project; actually implementing the changes to the symbol table code, or implementing the changes to the "get the next word" code. By Monday evening February 23 at 11:59pm, submit solutions to the three assignments in the following steps. Also by Monday the 23th, submit peer evaluations for your classmates' submissions for homework assignment 4.

11.3.2 (Display page) Homework 5, exercise 1 Homework 5, exercise 1

Implement the following C code directly in MIPS and verify that it works in MARS. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 50 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

void set_array (int array[], int num) { int i; for (i=0; i<10; i++){ array[i] = compare (num, i); } }

int compare (int a, int b) { if (sub (a, b) >= 0) return 1; else return 0; }

int sub (int a, int b) { return a-b; }

int main() { int i; int Carray[10]; for (i=0; i<10; i++) Carray[i] = i; set_array(Carray, 5); } Retain the structure of the existing code; i.e. don't optimize any function calls. You will need a few instructions at the top that call main and terminate after the return. These you can find in the lab examples. Your solution should adhere to MIPS conventions for subroutine linkage and register use. Be sure to handle the stack pointer appropriately. (Don't worry about $fp, the frame pointer.) The array declared in main is allocated on the stack, and i in set_array corresponds to $s0. Submit your solution to Expertiza as "C-to-MIPS translation". 11.3.3 (Display page) Homework 5, exercise 2 Homework 5, exercise 2

Write the following vectorSum function as a MIPS procedure call:

struct vector { int x; int y; };

void vectorSum (struct vector* vp, struct vector** vectors, int len) { ... }

The function vectorSum takes a one-dimensional array of pointers to struct vector values and calculates the sum of all the vectors. The parameter len is the number of elements in the array. The sum of two vectors (x1,y1) and (x2,y2) is simply (x1+x2, y1+y2). The resulting vector should be stored in the struct vector pointed to by vp. You should assume that the integer x is stored at a lower memory address than y. Therefore, if the address of a struct vector is 0x50000, then the member x is located at 0x50000, and the member y is located at 0x50004. Your solution should adhere to MIPS conventions for subroutine linkage and register use. Submit your solution to Expertiza as "vector sum". 11.3.4 (Display page) Tower of Hanoi (homework 5, exercise 3) Tower of Hanoi (homework 5, exercise 3)

The Tower of Hanoi problem is well known. It's described here. It has an elegant recursive solution, which you'll implement for this homework exercise. Translate the following C program (in ~cs61cl/code/hanoi.c) to MIPS assembly language in the file hanoi.s. The program assumes there are pegs labeled A, B, and C. It prints moves of the form x y, where x and y are pegs. For example, the move AB means move the top disk from the peg labeled A to the peg labeled B. Adhere to previously covered conventions for managing registers and the stack. Translate each call to putchar into a call to syscall; see the MARS online help for further information. Note that syscall, like any other procedure, may trash some registers.

void main ( ) { Hanoi (5, 'A', 'B', 'C'); }

void Hanoi (int num, char from, char to, char other) { if (num == 0) { return; } Hanoi (num-1, from, other, to); putchar (from); putchar (to); putchar ('\n'); Hanoi (num-1, other, to, from); } Submit your solution to Expertiza as "tower of hanoi". 11.3.5 (Display page) Readings Readings

Read P&H, section 2.6 (fourth edition) or 2.5 (third edition) for next lab. 12 More on MIPS procedures; logic operators 2009-2-17 ~ 2009-2-18 (5 activities)

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 51 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

12.2 Project 1 checkoff(1 step) 12.2.1 (Display page) Checkoff requirements Checkoff requirements

In today's lab, you will be expected to show progress on project 1. You will earn 1 homework point for doing each of the following:

explaining what changes are necessary to adapt the symbol table code in either your homework 3 solution or the posted solution to the project; explaining what changes are necessary to adapt the "get the next word" code from your homework 3 solution or the posted solution to the project; actually implementing the changes to the symbol table code, or implementing the changes to the "get the next word" code.

T.a.s will also want to see whatever code you've produced so far. This is also a good time to get answers to questions you have about the project. 12.3 Explore more aspects of MIPS procedures.(5 steps) 12.3.1 (Brainstorm) Exchanging two variable values Translate the following procedure to MIPS assembly language. The pointers should be passed in registers $a0 and $a1, and the temp variable should be allocated on the stack.

void swap (int *px, int *py) { int temp; temp = *px; *px = *py; *py = temp; } Make sure you compare your solution to at least one other before continuing. Here's the MARS driver from the last lab if you want to try it out.

# default start to call main and terminate _main: jal main li $v0, 10 syscall

# Your new stuff goes here ...

main: add $s0, $ra, $0 # save return address # set up args

jal swap # do stuff with result add $ra, $s0, $0 # restore return address jr $ra 12.3.2 (Brainstorm) A buggy swap procedure Now modify your solution to the previous step to implement the following buggy version of the swap procedure. As in the previous step, assume that the pointers are passed in registers $a0 and $a1, and the temp variable is allocated on the stack.

void swap (int *px, int *py) { int *temp; *temp = *px; *px = *py; *py = *temp; } Students have noted that a C version of this code sometimes "works". What does the MIPS version do in MARS? 12.3.3 (Brainstorm) An elusive bug Explain what the bug in the buggy version of swap is, and why its programmer might believe that the code works even after testing it. 12.3.4 (Brainstorm) Exposing the bug Supply the definition of a C procedure proc to be called in the main program immediately prior to the call to the buggy swap, that will guarantee that swap will crash when the uninitialized temp pointer is dereferenced. Also explain why your call guarantees this crash. Hint: your proc procedure will leave something on the stack. 12.3.5 (Display page) The Fibonacci sequence The Fibonacci sequence

Write a MIPS procedure to compute the nth Fibonacci number F(n) where:

F(n) = 0, if n = 0; 1, if n = 1; F(n-1) + F(n-2), otherwise; Your solution should retain the recursive structure. Don't optimize the code to remove the recursive calls.

int fib (int n) { if (n == 0) { return 0; } else if (n == 1) { return 1; } else { return fib(n-1) + fib(n-2); } http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 52 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

} Test your solution in MARS. 12.4 Work with logic operators.(11 steps) 12.4.1 (Display page) MIPS bitwise instructions MIPS bitwise instructions

Up until now, the MIPS instructions we've worked with all view the contents of a register as a single quantity (e.g. a character, a signed integer, or an address). The upcoming activities take a different perspective: they view a register as containing 32 individual bits rather than as a single value. They will use two new classes of instructions: logical and shift operations. There are two basic logical operators.

and: outputs 1 if both operands are 1; outputs 0 otherwise. or: outputs 1 if at least one operand is 1; outputs 0 if both operands are 0.

We sometime use a truth table to represent a logical expression; it lists all possible combinations of operand values and the resulting output for each. The truth tables for and and or are below. A B A and B A or B 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 The corresponding MIPS instructions are and, or, andi, and ori. Each takes three arguments:

1. the register in which the result will be stored; 2. the first operand—a register; 3. the second operand—a register for and and or, a numeric constant for andi and ori.

These instructions are all bitwise, meaning that bit 0 of the output is produced by the respective bit 0's of the operands, bit 1 by the bit 1's, etc. They correspond to the C operators & and |. Incidentally, C's bitwise operators are described in section 2.9 of K&R and explained in this online tutorial. 12.4.2 (Brainstorm) Logical vs. bitwise "and" Are there operands a and b for which the program segments below will produce different output? Briefly explain your answer. if (a & b) { if (a && b) { printf ("true\n"); printf ("true\n"); } else { } else { printf ("false\n"); printf ("false\n"); } } 12.4.3 (Brainstorm) Logical vs. bitwise "or" Are there operands a and b for which the program segments below will produce different output? Briefly explain your answer. if (a | b) { if (a || b) { printf ("true\n"); printf ("true\n"); } else { } else { printf ("false\n"); printf ("false\n"); } } 12.4.4 (Display page) Bit masking Bit masking

"Anding" a bit with 0 results in an output of 0, while anding a bit with 1 yields the original bit. This property can be used to create a mask, a sequence of bits used to selectively screen out or let through certain bits in a data value. Here's an example.

mask that retains the last 12 bits

0000 0000 0000 0000 0000 1111 1111 1111 data value

1011 0110 1010 0100 0011 1101 1001 1010 the result of anding them

0000 0000 0000 0000 0000 1101 1001 1010

Thus the and operator can be used to set certain portions of a bit string to 0's, while leaving the rest alone. In particular, if the first bit string in the above example were in $t0, then the masking operation would be implemented as

andi $t0,$t0,0xFFF

Unlike the addi instruction, in which the immediate operand is sign-extended prior to computing the sum, the immediate operand for the andi and ori are not sign-extended. 12.4.5 (Brainstorm) A mask for the upper bits in a register Give a sequence of instructions that uses and to set the rightmost two bits of register $s0 to 0. Start by putting an appropriately-valued mask in register $t0. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 53 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

12.4.6 (Display page) Turning bits on Turning bits on

"Oring" a bit with 1 results in an output of 1, while oring a bit with 0 yields the original bit. This property can be used to force certain bits of a bit string to 1. For example, if $t0 contains 0x12345678, then after executing the instruction

ori $t0,$t0,0xFFFF

$t0 contains 0x1234FFFF. 12.4.7 (Display page) Shift instructions Shift instructions

MIPS shift instructions move (shift) all the bits in a word to the left or right by a constant number of bits. For example, shifting the bit pattern

0001 0010 0011 0100 0101 0110 0111 1000

left by 8 bits does the following: 0's are shifted in at the right. There are three shift instructions.

sll ("Shift Left Logical") shifts left and fills emptied bits with 0's. srl ("Shift Right Logical") shifts right and fills emptied bits with 0's. sra ("Shift Right Arithmetic") shifts right and fills emptied bits with the sign bit of the shifted value.

Examples appear below. Shift right arithmetic by 8 bits, sign bit = 0

Shift right arithmetic by 8 bits, sign bit = 1

Shift right logical by 8 bits, sign bit = 1

These operators correspond to the C operators << and >>. Interestingly, the C standard does not specify what replaces the emptied bits in a right shift of a signed quantity. Java corrects this ambiguity by providing two right shift operators: >> does an arithmetic right shift (filling with the sign bit), and >>> does a logical right shift. Shifting is likely to be faster than multiplication. A good compiler will notice C code that multiplies or divides by a power of 2 and will compile it to a shift instruction. (The next step will ask you to explain why.) Examples: C statement generated code k = k * 8; sll $t0,$t0,3 k = k / 16; sra $t0,$t0,4 12.4.8 (Brainstorm) No sla instruction? Why does MIPS not provide a sla ("Shift Left Arithmetic") instruction? 12.4.9 (Display page) xor xor

Two other MIPS instructions are xor ("eXclusive OR") and xori. Their syntax is the same as that of and and or. As with and and or, an immediate operand is not sign-extended. (The C exclusive-or operator is "^", the caret.) The xor operation outputs 1 if exactly one of its operands is 1, and outputs 0 otherwise. Here's the truth table. A B A and B 0 0 0 0 1 1 1 0 1 1 1 0

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 54 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The xor instruction is used with a mask to invert selected bits. Here's an example.

mask that retains the last 12 bits

0000 0000 0000 0000 0000 1111 1111 1111 data value

1011 0110 1010 0100 0011 1101 1001 1010 the result of xoring them

1011 0110 1010 0100 0011 0010 0110 0101

Tricks with xor are part of the CS low-level programming culture. Here's one: exchange the values of two variables a and b without using a temporary variable. The solution is

a = a xor b; b = a xor b; a = a xor b;

Try it with a = 0101, b = 1100. (We promise that this won't appear on an exam.) 12.4.10 (Display page) Finding the leftmost 1 Finding the leftmost 1

Write and test code to store in $v0 the position of the leftmost bit in the word in register $a0. If $a0 contains 0, store -1 in $v0. You are allowed to modify $a0 in the process of finding this position. Positions range from 0 (the rightmost bit) to 31 (the sign bit). You should provide two solutions: one that repeatedly shifts $a0, checking the sign bit at each shift, and the other that starts a mask at 0x80000000 and repeatedly shifts it right to check each bit in $a0. 12.4.11 (Brainstorm) Shift left = multiply? Explain why shifting a nonnegative integer value left 1 bit position doubles the value. (It works for negative integers too; we'll see that next week.) 12.5 Readings(1 step) 12.5.1 (Display page) Upcoming coverage Upcoming coverage

We'll be covering machine instruction format in next week's lab, sections 2.5 and 2.10 in P&H fourth edition or sections 2.4 and 2.9 in the third edition. There will also be a couple of activities involving interesting programming techniques. 13 Table-driven programming; MIPS machine instructions 2009-2-19 ~ 2009-2-20 (4 activities) 13.1 Effective address calculations(3 steps) 13.1.1 (Display page) C data structures in MIPS C data structures in MIPS

In earlier labs we learned the basic tools provided in C to create data structures:

Array — a homogeneous collection of contiguous objects that can be referenced relative to the whole by indexing Struct — a grouping of a collection of objects, often heterogenous, into a larger object such that the elements are referenced by name. Pointers — a reference to an object, which can be access be dereferencing.

At the assembly and machine level, the only concepts that are present are arithmetic, loads, and stores. All paths through a data structure boil down to these basic machine operations. Let's start with a simple example.

int a[] = {5,7, -1}; int main() { int sum = 0; int *p = a; int i; for (i=0; i<3; i++) { sum = sum + *p; p++; } } The MIPS code might look like this.

.data a: .word 5 # int a[] = {5, 7, -1} .word 7 .word -1

.text _main: la $t0, a # address of array li $t1, 3 # length of array add $a0, $0, $0 # running sum li $t2, 0 # index loop: bge $t2, $t1, done # while (index < length) lw $t3, 0($t0) add $a0, $a0, $t3 # sum += *a addi $t0, $t0, 4 # a++ addi $t2, $t2, 1 # index ++ j loop

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 55 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

done: li $v0, 1 # print integer in $a0 syscall

li $v0, 10 # terminate syscall

(You can cut and paste this example and try it out in MARS.) Observe the temporary register $t0 is serving to hold the pointer. The lw instruction performs the *p. The address is incremented by the size of the element, in this case 4. What would the code look like if it implemented

sum = sum + a[index]; more directly? What it look like if the data structure were an array of structs? An array of pointers to ints? An array of pointers to structs? 13.1.2 (Display page) Types of addressing Types of addressing

The MIPS architecture provides several addressing modes, that is, ways of accessing a value.

Immediate addressing accesses a value stored "immediately" in a machine instruction. Absolute addressing uses part of an instruction to directly access memory. Register direct addressing first accesses the index of one of the registers from an instruction, then retrieves the contents of the indexed register. Register indirect adds a step to register-direct addressing. It retrieves a register index from an instruction, it retrieves an address (a pointer) from the register, and then it retrieves the contents of memory at that address. The address retrieved from the register is sometimes called a base address. Base plus offset addressing adds an offset to the base address, then retrieves the contents of memory at that address.

The next step provides some self-check examples of these modes. 13.1.3 (Self Test) Data structures and effective address computations The following code implements the summation of an array of ints using an index off the base of the array.

.data a: .word 5 .word 9 .word -1

.text _main: la $t0, a # address of array li $t1, 3 # length of array add $a0, $0, $0 # running sum li $t2, 0 # index loop: bge $t2, $t1, done # while (index < length) sll $t3, $t2, 2 # convert index to byte offset add $t3, $t0, $t3 # compute ∓a[index] lw $t3, 0($t3) # add $a0, $a0, $t3 # sum += a[index] addi $t2, $t2, 1 # index ++ j loop done: li $v0, 1 # print int $a0 syscall

li $v0, 10 # terminate syscall What form of addressing is used in the instruction

la $t0, a

What form of addressing is used in the instruction

li $t1, 3

What form of addressing is used in the instruction

add $a0, $0, $0

What form of addressing is used in the instruction

lw $t3, 0($t3)

What form of addressing is used for the source in the instruction

lw $t3, 132($0)

What form of addressing is used for the source in the instruction

lw $t3, 132($t0) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 56 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The following modifies our example to sum the integer elements of an array of structs.

.data a: # struct {char letter; int val}[] = {{'a',5},{'b',7},{'c',-1}}; .ascii "a" .align 2 .word 5 .ascii "b" .align 2 .word 7 .ascii "c" .align 2 .word -1

.text _main: la $t0, a # address of array li $t1, 3 # length of array add $a0, $0, $0 # running sum li $t2, 0 # index loop: bge $t2, $t1, done # while (index < length) # FILL IN THIS INSTRUCTION add $a0, $a0, $t3 # sum += *a.val addi $t0, $t0, 8 # a++ addi $t2, $t2, 1 # index ++ j loop done: li $v0, 1 # print integer in $a0 syscall

li $v0, 10 # terminate syscall What is the missing instruction?

The following modifies our example to be an array of pointers to ints. Note that we need to allocate space for the array of pointers and for the ints. We can even set it all up statically. Again you need to fill in the missing instruction. You may cut and paste the example to a file so that you can run it under MARS.

.data v1: .word 5 v2: .word -1 v3: .word 7 a: .word v1 # int *a[]={&v1,&v3,&v2}; .word v3 .word v2

.text _main: la $t0, a # address of array li $t1, 3 # length of array add $a0, $0, $0 # running sum li $t2, 0 # index loop: bge $t2, $t1, done # while (index < length) lw $t3, 0($t0) # # FILL IN THIS INSTRUCTION add $a0, $a0, $t3 # sum += **a addi $t0, $t0, 4 # a++ addi $t2, $t2, 1 # index ++ j loop done: li $v0, 1 # print integer in $a0 syscall

li $v0, 10 # terminate syscall What's the missing instruction?

The following modifies it further to be an array of pointers to structs containing a field named val.

.data v1: .ascii "a" .align 2 .word 5 v2: .ascii "c" .align 2 .word -1 v3: .ascii "b" .align 2 .word 7

a: .word v1 # int *a[]={&v1,&v3,&v2}; .word v3 .word v2

.text _main: la $t0, a # address of array li $t1, 3 # length of array add $a0, $0, $0 # running sum http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 57 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

li $t2, 0 # index loop: bge $t2, $t1, done # while (index < length) lw $t3, 0($t0) # # FILL IN THIS INSTRUCTION add $a0, $a0, $t3 # sum += **a addi $t0, $t0, 4 # a++ addi $t2, $t2, 1 # index ++ j loop done: li $v0, 1 # print integer in $a0 syscall

li $v0, 10 # terminate syscall What's the missing instruction?

13.2 Practice table-driven programming.(2 steps) 13.2.1 (Display page) The scan operation The scan operation

An instruction on the IBM 360, which we'll call scan, worked as follows. Given the address of the start of a string and the address of a 256-byte table of values, it essentially implemented the following C code:

char *sptr; char table[256]; for (sptr = address of start of the string; table[*sptr] == 0; sptr++) { } That is, it scans the string looking for the first character whose corresponding table entry is nonzero. The file ~cs61cl/code/scan.s contains the framework of a program to skip past white space in a string; you are to complete this program. Test your solution code with the following input lines:

an empty input line; a line containing only blanks and tabs; lines that start with each of the following characters: / (slash), 0..9, : (colon), @ (at sign), A..Z, [ (left bracket), ` (backquote), a..z, and { (left brace).

Part a

At the location in the program labeled by table, provide a table of bytes with the following contents.

Entries corresponding to the blank, tab, carriage return, and line feed characters should be 0. The entry corresponding to the null character should be 1. Entries corresponding to letters (either upper-case or lower-case), digits, and the underscore character should be 2. All other entries should be 3.

ASCII tables are available online. The .byte directive will be useful. For example, to initialize the next five bytes to zero, you might say

.byte 0,0,0,0,0

Part b

Supply the code that implements the scan instruction. Your code should be given the string address in register $s1 and the table address in register $s2; it should branch to the label scan_done when it finds a character in the string whose corresponding table value is nonzero, leaving the address of that character in $s1 and the corresponding table value in $v0. 13.2.2 (Display page) A jump table A jump table

A jump table applies table-driven programming to instructions. It consists essentially of an array of addresses. The appropriate address from the table is computed, and then jumped to. Fill in the missing code in the following program, which implements the initial step in the casino game of Craps. (The rules are as follows: if you roll a 2, 3, or 12, you lose; if you roll a 7 or 11, you win; if you roll anything else, the game continues.) The code is online in ~cs61cl/code/craps.s.

main: li $v0,5 # read the roll value syscall move $t0,$v0 # Use the roll value to access one of the addresses in the table. # Put that address into $t1, then jr $t1. # Your code to do that goes here. .data jumptable: .word lose # 2 .word lose # 3 .word trypoint # 4 .word trypoint # 5 .word trypoint # 6 .word win # 7 .word trypoint # 8 .word trypoint # 9 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 58 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

.word trypoint # 10 .word win # 11 .word lose # 12

.text lose: li $v0,1 move $a0,$t0 syscall li $v0,4 la $a0,lossstring syscall li $v0,10 syscall win: li $v0,1 move $a0,$t0 syscall li $v0,4 la $a0,winstring syscall li $v0,10 syscall trypoint: li $v0,1 move $a0,$t0 syscall li $v0,4 la $a0,trystring syscall li $v0,10 syscall

.data lossstring: .asciiz ": loses\n" winstring: .asciiz ": wins\n" trystring: .asciiz ": roll for point\n" 13.3 Explore MIPS machine instructions and assembly language pseudoinstructions.(20 steps) 13.3.1 (Display page) Descending to another abstraction level Descending to another abstraction level

We have seen how builtin Java operations for storage allocation are implemented in C, and also how most C operations are implemented in assembly language. We're now ready to move to the next lower level of abstraction, namely, machine language. The powerful idea underlying this level is the stored program concept: that machine instructions are represented as bit patterns in memory just as data are. Since all instructions and data are stored in memory, everything—instructions and data words—has a memory address. For example, C pointers can point to instructions as well as numeric data. A specific hardware register (inaccessible to users) named the program counter keeps track of the address of the instruction being executed. (Intel refers to this as the "instruction address pointer", a much better name.) The CPU executes a program as follows:

1. Load the instruction addressed by the program counter from memory. 2. Increment the program counter. 3. Determine what the instruction is, and what its operands are. 4. Execute the instruction. (This may change the program counter.) 5. Repeat.

All the data we've worked with has been words (or pieces of words, e.g. bytes). Each register is a word. The lw and sw instructions both access memory one word at a time. It makes sense to have instructions be words as well. A MIPS machine instruction will therefore consist of 32 bits, split into fields (bit sequences) that provide information about the type of instruction and its operands. Two factors governing the design of the MIPS architecture are simplicity and uniformity:

the number of different instruction formats should be minimized, and the positions of bit fields in instructions of different formats should be as consistent as possible.

13.3.2 (Brainstorm) Instruction groups Here are some instructions. Categorize them into three groups so that all the instructions in each group have a similar format. j sub ori jal addi slt add or 13.3.3 (Self Test) Identifying registers Some bit fields in MIPS instructions will be used to identify register operands. How many bits are necessary to identify a particular MIPS register?

13.3.4 (Self Test) Identifying instructions There are 74 MIPS instructions. How many bits are necessary to distinguish a particular instruction?

13.3.5 (Display page) Instruction formats Instruction formats

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 59 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The MIPS architecture provides three basic instruction formats:

R-format instructions, all having three register operands; J-format instructions, namely j and jal; and I-format instructions, all having an "immediate" operand (a constant stored in the instruction itself).

The leftmost bits of each instruction form the op code, which helps to distinguish the instruction from all the others. The next two fields in R format and I format specify registers, using five bits each. In R format, there is a third register operand. In J format, everything but the op code is used to specify the jump destination address. The three formats are diagrammed below.

One may notice from the diagram that the information encoded in an I-format instruction or a J-format instruction will fill the 32 bits. Whatever bits aren't taken up by the op code and the registers will be used to maximize the size of the immediate operand; similarly, the bits not occupied by the op code for a jump instruction will be used to maximize the size of the destination address. Thus we would like for the op code to take up as few bits as possible. We noted earlier that there are 74 MIPS instructions. It seems that we need at least 7 bits to distinguish among them. However, the R-format instructions provide an opportunity for economizing on bits for the op code. Most R-format instructions need bits to specify the operation and the three register operands; this leaves unused space in the instruction. The MIPS instruction format takes advantage of this space by assigning all the R-format instructions an op code of 0, then using a bit field elsewhere in the instruction to distinguish between instructions. There are 28 R-format instructions. Assigning them an op code of 0 leaves 46 other op code values, which can be specified in 6 bits. The diagram below specifies the exact bit fields used for each type of instruction. The field labeled "function" is used to distinguish among all the instructions with op

code 0. Consistent with P&H's style, we write each field as a 5- or 6-bit unsigned integer, not as part of a 32-bit integer. 13.3.6 (Display page) More details about R-format instructions More detail about R-format instructions

The figure below provides more detail about R-format instructions.

The op code is 0, as noted previously. The funct field is then used to identify the instruction. There are three register fields:

rs ("Source Register"), generally used to specify the register that contains the first operand. rt ("Target Register"), generally used to specify the register that contains the second operand (note the misleading name). rd ("Destination Register"), generally used to specify the register that will receive the result of the computation.

Note that the order of registers in an assembly language R-format instruction differs from their order in the corresponding machine instruction. Also, we say "generally" above since there are a few exceptions. For example, mult and div have nothing important in the rd field since they use special registers named hi and lo as destinations. The shamt field is used for shift instructions; it's 0 for a non-shift instruction. It contains the amount that the shift will shift by. Since it makes no sense to shift a register more than 31 bits, this is a 5-bit field. We now know enough to decode an R-format machine instruction into assembly language, and to translate an assembly language instruction to machine code. (The green card on the inside front cover of P&H is a big help here.) Here is an example. Translate add $8,$9,$10 to machine language.

opcode is 0 (look that up in the table in P&H); funct is 32decimal (look that up too); rd is 8 (destination); rs is 9 (first operand); http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 60 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

rt is 10 (second operand); shamt is 0 (not a shift).

Substituting into the bit fields, we get the following.

In decimal:

In binary:

We find the hexadecimal representation by splitting the bits into 4-bit hexadecimal digits; this is 012A4020. The decimal value, in case you're interested, is 19,546,144. 13.3.7 (Brainstorm) Translation to machine language (1) Translate the instruction

sll $10,$12,13 to the corresponding machine instruction. Describe how you did it. 13.3.8 (Brainstorm) Translation to assembly language (1) Translate the machine instruction 02E4B023hex to the corresponding assembly language instruction. Describe the steps you used to do the translation. 13.3.9 (Display page) J-format instructions J-format instructions

The diagram below shows the bit fields of the J-format instructions.

Ideally, we would be able to specify any 32-bit address to jump to. We unfortunately can't fit both a 6-bit opcode and a 32-bit address into a single word. How close can we get? We start by taking advantage of the fact that all instructions are one word long, and are aligned on word boundaries. That means that the address of each instruction is divisible by 4, and in particular, the last two bits of the address of an instruction are 0. If we know they are 0, we don't need to store them in the target address. We now have specified 28 bits of a 32-bit address (the 26 bits in the target address field and the two 0 bits at the end). To complete the 32-bit address, we use the top 4 bits of the program counter. We can't use a jump instruction to get to anywhere in memory. However, the procedure just described handles almost all the cases since most programs aren't that long. If we absolutely need to specify a 32-bit address whose top 4 bits don't match those of the program counter, we can merely load the address into a register and use the jr instruction. 13.3.10 (Brainstorm) Jump distance? Suppose that the instruction at address 0x40000004 is 0x08000006 (a j instruction), and that instruction is executed. How far away in bytes is the jumped-to instruction? 13.3.11 (Brainstorm) Maximal jump distance? What's the maximal distance in bytes that one can jump using a J-format instruction? Briefly explain your answer. 13.3.12 (Display page) I-format instructions I-format instructions

Finally, there are the I-format instructions. Their format appears below.

opcode is the same as in the other instruction formats. Since there is no funct field, the op code uniquely specifies an I-format instruction. rs is as in R-format instructions, namely the instruction operand if one exists. rt specifies a register into which the result of the computation will be stored (that's why it's referred to as the "target" register), or a second operand for some instructions.

Note that only one field is inconsistent with R-format; in particular, the op code and the two register operands are in the same place in both formats. This simplifies the instruction decoding hardware. Some obvious examples of I-format instructions are addi and ori. Loads, stores, and branches also are I-format instructions. 13.3.13 (Brainstorm) Translation to assembly language (2) What assembly language instruction corresponds to 0x21c90005? Briefly describe how you determined it. 13.3.14 (Display page) Forming a 32-bit value from a 16-bit immediate field Forming a 32-bit value from a 16-bit immediate field

In general, we will need to produce a 32-bit value from a 16-bit word. For example, we would want the operand in the instruction addi $10,$9,-1 to be the word containing -1. One way to do this is sign extension; for example, in the addi, slti, lw, and sw instructions, the immediate field is treated as a signed integer and sign-extended. This is good enough to handle the offset in a typical lw or sw, plus a vast majority of the values that are used in an slti instruction. This doesn't handle every case, however. To produce bigger operands, MIPS provides another instruction named lui ("Load Upper Immediate"), which takes a 16-bit immediate value and puts it into the upper half—the high- http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 61 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

order half—of the specified register. It sets the low-order half of the register to 0. Here's how lui is used to produce a li of a large constant.

li $t0,0xABABCDCD becomes

lui $t0,0xABAB ori $t0,$t0,0xCDCD 13.3.15 (Display page) Pseudoinstructions Pseudoinstructions

The steps we just observed that perform a 32-bit "load immediate" are actually taken care of by the assembler in a pseudoinstruction. This is an instruction that's not implemented in the hardware, but which the assembler recognizes and translates into instructions that are. Some pseudoinstructions are provided as synonyms for common machine instructions; for example, the pseudoinstruction

move destReg, sourceReg is translated to

add destReg,sourceReg, $0 Others are translated into sequences of machine instructions, generally to handle the problem of immediate operands that are too big to fit in an instruction. Here are two more examples: pseudoinstruction translation lui $at,top half of label address la $a1,label ori $a1,$at,bottom half of label address lui $at,top half of label address lw $s0,label lw $s0,bottom half of label address($at) Note that the lw in the second instruction is not a pseudoinstruction, but a regular I-format lw. Sometimes the assembler needs a temporary register to be able to implement a pseudoinstruction, for instance, in the la and lw pseudoinstructions above. The register $at, for "Assembler Temporary", is conventionally reserved for this purpose. (MARS doesn't enforce this convention.) 13.3.16 (Brainstorm) Distinguishing pseudoinstructions from real instructions Suppose you are reading an assembly language program, and come upon a lw instruction. How can you tell whether or not the lw is a pseudoinstruction? Briefly explain your answer. 13.3.17 (Display page) Branches Branches

The instructions beq ("Branch if EQual") and bne ("Branch if Not Equal") are also I-format instructions. (Their counterparts, bgt, bge, blt, and ble are pseudoinstructions that are assembled into combinations of slt and beq or bne.) The format of beq and bne is shown below.

opcode distinguishes between beq and bne. rs and rt specify the registers to compare.

A question: What can immediate specify? It's only 16 bits. The program counter, which stores the address of the next instruction to execute, is a 32-bit pointer to memory. Thus the immediate field cannot store a complete address to branch to. The typical uses of a branch are to implement if ... then ... else, for, and while. Loops are generally small, as are if ... then ... else statements. In contrast, far-away jumps for, say, subroutine calls are handled by j and jal. That suggests that a more bit-economical method of handling a branch is to store a relative value—the difference between the current instruction and the program counter—instead of an absolute address. Here's an example:

loop: beq $9,$0,end # i.e. jump ahead 4 instructions add $8,$8,$10 addi $9,$9,1 j loop end: ... The immediate field is thus a signed two's complement integer that provides a PC-relative address. As with J-format instructions, we know that the immediate field is divisible by 4; therefore we don't need to store those two 0 bits. A good interpretation of the immediate field, then, is that it specifies how many instructions to jump back or ahead. beq and bne are executed as follows.

1. The appropriate comparison is done. 2. If it fails, we don't branch; PC = PC+4, the byte address of the next instruction. 3. If it succeeds, PC = (PC+4) + immediate * 4. (immediate is added to PC+4, not PC, because the program counter has already been incremented earlier in the instruction decoding phase. We'll revisit this later in the course.)

Returning to our example, we'll translate the beq $9,$0,end into a machine instruction. We look up the opcode; it's 4. rs is 9; rt is 0. immediate is the number of instructions to add to or subtract from the PC, starting at the instruction following the branch. Here, immediate is 3. Here's the decimal

version: And now the binary: http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 62 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

In hexadecimal, we have 0x11200003. 13.3.18 (Brainstorm) Translation to assembly language (3) Translate the beq in the code below to a machine instruction. Briefly describe how you did it.

loop: addi $8,$8,10 addi $9,$9,-1 beq $9,$0,loop 13.3.19 (Brainstorm) Effect of code movement Suppose the three-instruction loop in the previous step were moved elsewhere in the program. Would its machine language representation change? Why or why not? 13.3.20 (Display page) Summary Summary

We have now descended to the level of machine instructions, and have found that there are three types of instructions:

R-format (three register operands) I-format (two register operands and an immediate operand J-format (one address operand)

All these instructions are 32 bits wide. All have the op code in the same place, and R-format and I- format instructions have register operands in the same place. Instructions in a program are word- aligned, which makes possible the saving of two bits in branch and jump instructions. Translating an assembly language instruction involves first recognizing it either as a pseudoinstruction or a native MIPS instruction. Pseudoinstructions translate into sequences of MIPS instructions, often to handle large immediate operands. For native MIPS instructions, we look up values for op code (and perhaps funct field) in P&H. For register operands, we merely plug in their values, being careful to keep the operand order correct. Finally, we translate to hexadecimal. In studying jump and branch instructions, we encountered two kinds of addressing: absolute addressing, used in jumps, and PC-relative addressing, used in branches. Absolute addressing is most flexible. PC-relative addressing arises from the need to specify a target address in only 16 bits. 13.4 Homework and reading(1 step) 13.4.1 (Display page) Upcoming activities Upcoming activities

Homework assignment 5 and peer evaluations of homework assignment 4 are due on next Monday evening, February 23. Due to the upcoming exam (February 25) and project 1, the peer evaluations of homework assignment 5 will be the only homework due next Friday. Project 1 is due Monday evening, March 3. There will be another project checkoff in the Tuesday/Wednesday labs of next week, asking you to show off some working code. You will earn homework points for being able to assemble different instructions. We'll be covering integer number representation in the next lab, section 2.4 in P&H fourth edition and section 3.2 in the third edition. 14 Number representation 2009-2-24 ~ 2009-2-25 (5 activities) 14.2 Consider the representation of negative values.(10 steps) 14.2.1 (Display page) A table of 8-bit values A table of 8-bit values

Consider the table of 8-bit values reached through this link. The table was produced by the program below, also in ~cs61cl/code/print8bittable.c. It shows all 8-bit values, first interpreting them as unsigned, then as signed. It also lists the bitwise complement—the 1's complement—of each 8-bit value and that bit pattern's interpretation as a signed integer. In the next few steps, we'll explore properties of and relationships between entries in this table.

#include

void printBinary (unsigned char);

int main ( ) { signed char s; unsigned char u; printf ("unsigned signed bits 1's signed\n"); printf (" char char compl equiv\n"); printf ("------\n");

for (s=0, u=0; u<255; s++, u++) { printf (" %3u %4hd ", (unsigned char) u, s); printBinary (u); printf (" "); printBinary (~u); printf (" %4hd\n", ~s); } printf (" %3u %4hd ", (unsigned char) u, s); printBinary (u); printf (" "); printBinary (~u); printf (" %4hd\n", ~s);

return 0; }

void printBinary (unsigned char k) { int shift; for (shift=7; shift>=0; shift--) { putchar ('0'+((k >> shift)%2)); http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 63 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

} } 14.2.2 (Self Test) Counts of negative and positive values Consider the second column of the table (the column labeled "signed char"). How many negative values are in this column?

How many positive values are in this column?

14.2.3 (Brainstorm) A characteristic of negative numbers Explain how you can tell if a given 8-bit quantity represents a negative number in the "signed char" column. 14.2.4 (Display page) Relating negative bit patterns to their positive counterparts Relating negative bit patterns to their positive counterparts

Develop a general procedure for relating a bit pattern for a value in this system to the bit pattern for its negation.

1. What value results from adding a number and its 1's complement? 2. Use this fact to determine how to get the bit pattern for the negative of a number. For instance, given the bit pattern for 5, find the bit pattern for -5. 3. See if your procedure works on the bit pattern for another positive number, say, 10. 4. Try your procedure on the bit pattern for a negative number. For example, given the bit pattern for -5, find the bit pattern for 5

14.2.5 (Brainstorm) Deriving -n from n Describe your procedure for determining the bit pattern for –n from the bit pattern for n. 14.2.6 (Display page) Why this works Why this works

First, some history. Representing negative numbers was a big concern in the early days of computing. A natural first choice was sign magnitude representation, basically the way people represent numbers. For example, the value -6 was represented as a negative sign (chosen to be a 1) and the binary representation of 6. The table below provides examples of an 8-bit sign magnitude representation. decimal value sign magnitude 3 0 0000011 2 0 0000010 1 0 0000001 0 0 0000000 -0 1 0000000 -1 1 0000001 -2 1 0000010 -3 1 0000011 Unfortunately, this approach has two big drawbacks. One is that there are two representations for zero, namely +0 and -0. This doubles the effort of comparing a value with 0. The other is that adding 1 to a positive value x, using unsigned binary arithmetic, results in x+1, but adding 1 to a negative value x results in x-1. Thus there have to be two algorithms for addition, one for adding values with the same sign and one for adding values with different signs. (The two algorithms are essentially addition and subtraction.) An alternative is 1's complement representation, in which we represent the negative of a value x merely as the bitwise negation of x (what results from inverting all the bits). This system still has the problem of two representations for zero, but at least adding a positive to a negative is more uniform except when a negative value results. As the experiments in previous steps may have suggested, we can solve both problems by sliding the negative values down one place, replacing the entry for negative zero by another negative number as shown below. This system for representing integers in binary is called 2's complement.

Negating a positive value x starts by finding its 1's complement. After the shift of the negative numbers, that yields a value that was what we wanted in 1's complement notation but is off by one in 2's complement notation. We add 1 to the result to accommodate the shift. Similarly, negating a negative value x starts with finding the 1's complement.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 64 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Because of the shift, this gets us a value that's 1 less than what we need, so we add 1 accordingly. From Patterson and Hennessy:

2's complement gets its name from the rule that the unsigned sum of an n-bit number and its negative is 2n; hence, the complement or negation of a two's complement number x is 2n – x.

14.2.7 (Self Test) A negative finding Consider the value that, represented in 16-bit 2's complement, appears in hexadecimal below.

0xA59F In hexadecimal, what's the negative of this value?

14.2.8 (Display page) Another way to view binary numbers Another way to view binary numbers

We can view a nonnegative number as having an infinite number of leading zeroes. What would be the corresponding property for a negative number? Let's consider -1. That should be the value that, when added to 1, produces 0. Some computation (shown below) reveals this to be the value that's all 1's. 1 1 1 1 ... 0001 ... 0001 ... 0001 ... 0001 ... 0001 ... 0001 ???? ???1 ??11 ?111 1111 ... 1111 ------... 0000 ... 0000 ... 0000 ... 0000 ... 0000 ... 0000 In general, a negative number has an infinite number of leading 1's. Computers unfortunately can't store an infinite number of bits. They do the easy thing, however, namely storing the rightmost bits. The situation where this "windowing"—viewing an infinite bit sequence through a finite "window"— gives an inaccurate view is called overflow. Overflow is defined as the situation where the leftmost retained bit, namely the sign bit, is not the same as what's in the infinite sequence of leading bits. Here are two examples, using an 8-bit window. add 1 to 127 add -1 to -128 1 ... 000 ... 01111111 ... 111 ... 1 10000000 ... 000 ... 00000001 ... 111 ... 1 11111111 ------... 000 ... 10000000 ... 111 ... 1 01111111 The leading zeroes or ones also come into play when increasing precision, that is, increasing the number of bits available to represent a value. In this situation, the sign is extended. For example, the assignment to b32 in the code below extends the sign bit of the character through the leftmost 24 bits of the int, storing 11111111111111111111111111111011 = -5 into b32.

int b32; signed char b8 = -5; /* now contains 11111011 */ ... b32 = b8; 14.2.9 (Display page) 32-bit 2's complement 32-bit 2's complement

The diagram below shows the 2's complement representation for 32-bit values.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 65 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

14.2.10 (Display page) Sign extension Sign extension

Occasionally you find that you want to convert a signed integer stored in some bit field into the same value stored in a larger bit field. For example, the addi instruction stores the immediate operand in 16 bits of the instruction. This operand needs to be expanded to 32 bits in order to add it to the value in the addi's register operand. Suppose the immediate operand is a negative number. Filling the extra 16 bits with 0 produces a positive value, not the negative value we desire. What we do instead is to fill the extra bits with the sign of the value being extended, to preserve in the expanded bit field the sign and value of the original. Wikipedia goes into somewhat more detail about the sign extension operation. 14.3 Project 1 checkoff(1 step) 14.3.1 (Display page) Checkoff requirements Checkoff requirements

Your lab t.a. will expect to see more progress on project 1 today. A test script has been made available for your use. Run it by making a directory, putting an executable version of your assembler named proj1 in it, copying the file ~cs61cl/code/proj1.tests/test.proj1 to the directory, and then giving the command

test.proj1 The script tests your code on 29 short programs. You get 1 homework point for each program you assemble successively, up to a maximum of 5. 14.4 Ask questions of your t.a.(1 step) 14.4.1 (Display page) Review and consolidation of your understanding Review and consolidation of your understanding

There will be an exam in class from 6 to 8pm on this coming Wednesday, covering everything up through last week's labs. Now is a good time to catch up on lab activities you haven't yet completed. You should also review activities you've done so far and make sure you understood the concepts underlying them. If there are aspects of C or assembly language you don't understand, now is the time to get them clarified. If there were "brainstorm" questions for which the correct answers were not clear to you, make sure to get appropriate explanations. Alternatively, you could work on the project. 14.5 Homework and readings(1 step) 14.5.1 (Display page) Upcoming activities Upcoming activities

Project 1 and your peer evaluation of homework assignment 5 are due next Monday, March 2. Activities for the next two labs will all involve floating-point computation. Relevant readings from P&H are sections 3.5, 3.7, and 3.8 in the fourth edition and sections 3.6, 3.7, and 3.8 in the third edition. 15 Issues relating to floating-point computation 2009-2-26 ~ 2009-2-27 (3 activities) 15.1 Work with decimal floating-point values.(13 steps) 15.1.1 (Display page) Representing more numbers Earlier in CS 61CL, we discussed ways to represent integer values. We noted that with N bits, we can represent the unsigned integers 0, 1, ..., 2N-1 and, using 2's complement, the signed integers – 2N-1, ..., 2N-1 – 1. But what if we want to represent a value with both integer and fractional parts? Decimal fractions can be represented either as a sequence of digits followed by a decimal point followed by another sequence of digits, for example,

3.14159 or in scientific notation as

m × be

where m is the mantissa, b is the base (10 in decimal), and e is the exponent. An example is

3.14159 × 100 3.14159 is the mantissa, 10 is the base, and 0 is the exponent. 15.1.2 (Brainstorm) Comparing notations for fractional values What values are better or more easily expressed in scientific notation than in conventional decimal notation? (By "conventional decimal notation", we mean a sequence of digits followed by a decimal point followed by another sequence of digits.) 15.1.3 (Display page) Floating-point values, operations, and terminology Floating-point values, operations, and terminology

Values can be expressed in scientific notation in more than one way. For example, .403 × 102 = 4.03 × 101 = 40.3 × 100. The decimal point is said to "float" in this situation, leading to the term floating-point to describe this representation. (In contrast, the term fixed-point refers to the situation where there is a fixed number of digits before the decimal point, and a fixed number of digits after the decimal point.) Usually it's preferable to fix the position of the floating decimal point, for example to require that the mantissa is between 0 and 9; with this convention, 4.03 × 101 is preferable to .403 × 102 and 40.3 × 100. A value expressed in this unique notation is said to be normalized. Multiplication of values represented in scientific notation is straightforward: we multiply the mantissas and add the exponents. For example:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 66 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(3.14 × 100) × (2.1 × 103) = (3.14 × 2.1) × 10(0+3) = 6.594 × 103 The resulting product may need to be renormalized. Division of two floating-point values is analogous. Addition and subtraction is somewhat more complicated. 15.1.4 (Brainstorm) Adding floating-point values Give an algorithm for adding two values expressed in scientific notation, for example, 4.8 × 102 and 3.1 × 10-1. 15.1.5 (Display page) Implications of finite precision Implications of finite precision

Just as 32 bits limited the range of representable integer values, the number of digits used to represent the mantissa and exponent in a floating-point representation will restrict the number of representable values. For example, suppose we limit ourselves to two unsigned decimal digits in the mantissa and to one unsigned decimal digit in the exponent:

d3 d1 d2 × 10

Then we can represent 3300 by 33 × 102; we can't represent 35300 since we only get two digits for the mantissa. 15.1.6 (Brainstorm) Smallest nonrepresentable integer? Identify the smallest positive integer not representable in the form

d3 d1 d2 × 10

where d1, d2, and d3 are all unsigned decimal digits. Briefly explain how you got your answer. 15.1.7 (Brainstorm) Comparison of three-digit representations (1) Consider the three representations A, B, and C that each uses three unsigned decimal digits d1, d2, and d3 to represent an integer N. Representation A Representation B Representation C d d d d N = d1 d2 × 10 3 N = d1 × 10 2 3 N = d1 d2 × 100 3 In which of A, B, and C can the largest value be represented, and what's the value? Briefly explain. 15.1.8 (Brainstorm) Comparison of three-digit representations (2) In which of representations A, B, and C can the most different values be represented? Choose one of a-e, and briefly defend your answer. Hint: duplicate values can arise in one or more of these representations when one of the dk is 0. a. All can represent the same number of different values. b. Two of the systems can represent an equal d A. N = d1 d2 × 10 3 number of values, more than the third. c. C can represent more values than the other B. N = d × 10d2 d3 1 two representations. d C. N = d1 d2 × 100 3 d. B can represent more values than the other two representations. e. A can represent more values than the other two representations.

15.1.9 (Display page) Distribution of representable values Distribution of representable values

A couple of things to note about representable values:

Though the number of values representable in a floating-point system may not be as many as the number representable in a fixed-point system, the range of values can be much larger. The representable values are not evenly distributed. Charted on a number line, they look like

this: That is, smaller representable values are closer together, and larger representable values are further apart. d3 Our representation d1 d2 × 10 can represent every positive integer from 1 to 99, but the largest "adjacent" values differ in magnitude by a billion.

15.1.10 (Brainstorm) Addition error? Assume that representation A is being used, and the addition

33 × 100 + x

gives a mathematically incorrect answer. What could x be? 15.1.11 (Display page) Roundoff error Roundoff error

The sum of two representable values is not necessarily representable itself. One answer to the previous step, for example, would be

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 67 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

33 × 100 + 09 × 101 which gives the unrepresentable value

123 × 100 The best we can do is choose the representable value that's closest. We do this by rounding. With a two-digit mantissa, the third digit indicates how to do this approximation: if it's 0-4, we round down; for 6-9, we round up; and for 5, we have a choice. Strategies for handling the 5 case include always rounding up, always rounding down, or sometimes rounding one direction and sometimes another. The latter strategy helps decrease accumulation of roundoff error. Another approach to dealing with roundoff error is to maintain a guard digit at the end of each operand. Patterson and Hennessy explain this technique on page 214. Another source of roundoff error occurs when a large value swamps a small one. Consider, for example, the sum

33 × 100 + 15 × 108 We compute this by first denormalizing the smaller operand so that the exponents are equal, then adding:

(1500000000 + 33) × 100 The 33 is much too small to survive in the sum. 15.1.12 (Brainstorm) Associativity of addition? Using the sources of computation error just mentioned, one can show that addition of values represented in floating-point is not associative. That is, there are a, b, and c for which a + (b + c) is not equal to (a + b) + c. Find values for a, b, and c that verify that addition is not associative in our three-decimal-digit floating-point system, and briefly explain your answer. 15.1.13 (Display page) Summary Summary

The preceding activities introduced you to floating-point representations of numeric values. We have seen that the number of bits provided in a representation—its precision—can restrict both the range of representable values and the number of different values that we can represent. Addition and subtraction are significantly more complicated than the integer counterparts. Finite precision also means that we must cope with roundoff errors that result from arithmetic operations producing unrepresentable results. 15.2 Work with binary floating-point values.(5 steps) 15.2.1 (Display page) Floating-point representation in binary Floating-point representation in binary

The floating-point representation that we explored in the last lab is a direct translation of scientific notation:

mantissa × 10exponent Floating-point representation in base 2 is similar to its base-10 counterpart.

mantissa × 2exponent The mantissa may involve a "binary point", which is the boundary between the integer and the fractional parts. Binary fractions are combinations of negative powers of 2; a table appears below. k 2-kin decimal Fraction in binary 0 1.0 1 1.0 1 0.5 1/2 0.1 2 0.25 1/4 0.01 3 0.125 1/8 0.001 4 0.0625 1/16 0.0001 5 0.03125 1/32 0.00001 6 0.015625 0.000001 7 0.0078125 0.0000001 8 0.00390625 0.00000001 9 0.001953125 0.000000001 10 0.0009765625 0.0000000001 11 0.00048828125 0.00000000001 12 0.000244140625 0.000000000001 13 0.0001220703125 0.0000000000001 14 0.00006103515625 0.00000000000001 15 0.000030517578125 0.000000000000001 15.2.2 (Self Test) Binary-decimal conversion Represent the decimal value 3.125 in binary.

Represent the binary value 10.101 in decimal.

15.2.3 (Self Test) The fraction part Assume that you have a 16-bit word. Instead of thinking of the binary point as being to the right of the least significant bit as we do with ints, think of the binary point as being to the left of the most significant bit. What value does the following bit pattern represent?

0000000000000000

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 68 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What about the following bit pattern?

1000000000000000 Express your result as a fraction x/y.

What about the following bit pattern?

1100000000000000 Express your result as a fraction x/y.

What about the following bit pattern?

0000000000000001 Express your result as a fraction x/y. Some useful powers of 2:

2^10 = 1024 2^15 = 32768 2^16 = 65536 2^17 = 131072

15.2.4 (Display page) A real-world example A Real-world example

Here's a story that illustrates the differences between using floating-point and using fixed-point representations of real numbers. The Texas Instruments MSP430 F1611 is a widely used 16-bit microcontroller that is used in millions of devices. It has an internal temperature sensor that is used to monitor its environment. This chip is supposed to operate from -40° to +85° C. It has a 12-bit analog-to-digital converter (ADC) that samples the internal temperature sensor and produces a value from 0 to 4095. A reading of 0 represents a voltage across the temperature sensor of 0 volts; conversely, 4096 corresponds to a voltage across the temperature sensor of 3.0 volts. Temperature increases linearly with voltage. The MSP430 F1611 User's guide says that the relationship between the voltage reading Vtemp and the Celsius temperature is the following:

Vtemp = 0.00355*TempC + 0.986 We want to know, given the voltage, what the corresponding temperature is. The voltage gives us a reading from the ADC. A little algebra reveals that

3.0 volts / 4096 = Vtemp / ADCtemp so

Vtemp = 3.0 * ADCtemp / 4096 Plugging this into your C program you convert the 12-bit integer ADC readings to degrees C with the expression expression:

TempC = ((3.0*ADCtemp/4096) - 0.986) / 0.00355 This produces fine answers in floating point. In fact, it even works on the MSP430, which has no floating point hardware, because it calls a library that emulates floating point in software. However, you discover that this takes more code than the whole rest of your application. 15.2.5 (Display page) Real-world example, part 2 Real-world example, part 2

Our problem: compute

TempC = ((3.0*ADCtemp/4096) - 0.986) / 0.00355 more quickly. The first column in the table below shows the ADC readings and the second shows the temperature that that reading represents according to the formula. In CS 61CL we learned all about fixed point, so we decide to use 16-bit integer operations to implement a close approximation to the temperature by representing a real as an 8.8 fixed point. The first 8 bits are the integer part of the value; the last 8 bits are the fraction. This allows us to represent values from –128 to +127, to a precision of 1/256 of a degree—far more accurate than our temperature sensor. We start by simplifying the calculation. Dividing is hard, so we replace the dividing by .00355 with multiplication by 281.7 (determined with a calculator). We leave the divide by 4096; it's just a shift by four places. (Remember that we're allowing 8 bits of fraction.)

TempC = (ADCtemp*3/4096 - 0.986) * 281.7 We start assigning pieces of the computation to temporary variables:

T1 = (ADCtemp*3) shifted left 4 When we multiply, we don't scale the constant. We are multiplying an 8.8 format number by a 16.0 format number to get at ?.8 format number. Now we convert the fractional quantities to 8.8 fixed- point format, first multiplying by 256 (that's a shift left by 8 bits), then rounding. We get

.986 floating point = 252 8.8 fixed point 281.7 floating point = 282 8.8 fixed point T2 = T1 - 252; # now we're using integer operations T3 = T2 * 282; Now we analyze how this can fail. How many bits does it take to represent T1 over the whole range? Look at column 3 in the table below. 10 bits are needed—actually 11, counting the sign bit. The table shows the T3 result with no truncation. Do we stay within 16 bits? Well, not quite. Over the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 69 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

full range it requires 18 bits. However, before we exceed the numerical range the chip either burns up or freezes, so the calculation is probably OK. The last two table columns show the conversion back to floating-point values and the resulting accuracy. Aren't you glad to have floating point so that you don't have to go through this kind of analysis? ADC TempC T1 T2 T3 Temp error 0 -277.8 0 -252 -71064 -277.6 -0.06% 128 -251.3 24 -228 -64296 -251.2 -0.07% 256 -224.9 48 -204 -57528 -224.7 -0.09% 384 -198.5 72 -180 -50760 -198.3 -0.11% 512 -172.1 96 -156 -43992 -171.8 -0.14% 640 -145.7 120 -132 -37224 -145.4 -0.18% 768 -119.3 144 -108 -30456 -119.0 -0.24% 896 -92.8 168 -84 -23688 -92.5 -0.34% 1024 -66.4 192 -60 -16920 -66.1 -0.51% 1152 -40.0 216 -36 -10152 -39.7 -0.90% 1280 -13.6 240 -12 -3384 -13.2 -2.81% 1408 12.8 264 12 3384 13.2 3.16% 1536 39.2 288 36 10152 39.7 1.09% 1664 65.6 312 60 16920 66.1 0.69% 1792 92.1 336 84 23688 92.5 0.51% 1920 118.5 360 108 30456 119.0 0.42% 2048 144.9 384 132 37224 145.4 0.36% 2176 171.3 408 156 43992 171.8 0.32% 2304 197.7 432 180 50760 198.3 0.28% 2432 224.1 456 204 57528 224.7 0.26% 2560 250.5 480 228 64296 251.2 0.24% 2688 277.0 504 252 71064 277.6 0.23% 2816 303.4 528 276 77832 304.0 0.22% 2944 329.8 552 300 84600 330.5 0.21% 3072 356.2 576 324 91368 356.9 0.20% 3200 382.6 600 348 98136 383.3 0.19% 3328 409.0 624 372 104904 409.8 0.18% 3456 435.5 648 396 111672 436.2 0.18% 3584 461.9 672 420 118440 462.7 0.17% 3712 488.3 696 444 125208 489.1 0.17% 3840 514.7 720 468 131976 515.5 0.16% 3968 541.1 744 492 138744 542.0 0.16% 4096 567.5 768 516 145512 568.4 0.16% 15.3 Homework and readings(3 steps) 15.3.1 (Display page) In the next lab ... In the next lab ...

We'll go over the actual representation of floating-point values on the MIPS computer (and, for that matter, on other architectures as well). Reminder: project 1 and peer evaluation of homework assignment 5 are due this coming Monday night. Homework assignment 6 follows in the next two steps. 15.3.2 (Display page) Analyzing different floating-point formats (hw6.1) Analyzing different floating-point formats

Solutions to this and the next problem are due Monday, March 9 at 11:59pm. Consider two representations for six-bit positive floating-point values. One—we'll call it representation B (for binary)—is a radix-2 representation. It stores an exponent in two bits, represented in two's complement. It stores the normalized significand in four bits, using the hidden bit as in the IEEE floating-point representation. Thus the value 2.5 (decimal) = 10.1 (binary) would be represented as

01 (1) 0100 and the value 7/8 (decimal) = 0.111 (binary) would be represented as

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 70 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

11 (1) 1100 The second representation—we'll call it representation Q (for quaternary)—is a radix-4 representation. Like representation B, it stores an exponent—of 4, not 2—in two bits, represented in two's complement. It stores the significand in four bits, two base-4 digits, with the first base-4 digit being to the left of the quaternary point and the second base-4 digit being to the right of the quaternary point. The significand is not necessarily normalized. 2.5 (decimal) thus is represented as

00 1010

since 2.5 (decimal) = 2.2 (quaternary) = 40 × (2 × 40 + 2 × 4-1). 7/8 (decimal) = 3 × 4-1 + 2 × 4-2 = 4-1 × (3 × 40 + 2 × 4-1) , so it's represented as

11 1110 The decimal fraction 3/4, which in quaternary is 0.3, has two representations, since it's expressible either as 40 × 3/4 or 4-1 × 3:

11 1100 00 0011

Part a

Find a value representable in representation B and not in representation Q. Defend your answer.

Part b

Fill out the following table. maximum minimum positive number of distinct representable representable value (in representable values value (in decimal) decimal) other than 0 representation B representation Q

Submission instructions

Combine your solution to this and the following problem into a file named hw6 for submission to Expertiza. This is not a partnership assignment. Hand in your own work. 15.3.3 (Display page) Using floating point (hw6.2) Using floating point

Submit solutions to this and the previous problem to Expertiza under the name hw6 by 11:59pm on March 9. This homework problem will give you a little experience using floating point numbers in C and a little picture of where you need to be careful.

Part a

Write a short C program that takes an integer input from the command line and prints the (x,y) coordinates of n equally spaced points around the unit circle and verifies that x2+y2=1 for all these points. For example:

$ ./circle 4 (1.000000,0.000000) r:1.000000 (-0.000000,1.000000) r:1.000000 (-1.000000,-0.000000) r:1.000000 (0.000000,-1.000000) r:1.000000

$ ./circle 5 (1.000000,0.000000) r:1.000000 (0.309017,0.951057) r:1.000000 (-0.809017,0.587785) r:1.000000 (-0.809017,-0.587785) r:1.000000 (0.309017,-0.951056) r:1.000000

You will need to use functions in the math.h library. Doing so requires two steps:

saying #include in your source file; giving the option -lm (that's minus L M) at the end of your gcc command. This tells the linker to include the math library, something that doesn't happen by default.

Add your program to the file you'll be submitting to Expertiza.

Part b

In high school you learned that (x+y)*(x–y) = x2 – y2. It would be nice if this continued to be true with the finite precision numbers that we actually compute with. Here's a little program that checks this.

#include #include http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 71 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

int main(int argc, char *argv[]) { float x = atof(argv[1]); float y = atof(argv[2]); printf("x: %f\n",x); printf("y: %f\n",y); printf("x^2 - y^2: \t%f\n", x*x-y*y); printf("(x+y)*(x-y): \t%f\n",(x+y)*(x-y)); } Try feeding it some numbers and see if you can violate the axiom. This will be difficult because the floating point expressions are computed in extended precision. Instead, experiment with this alternative that stores the intermediate values in a single precision float variable.

#include #include

int main(int argc, char *argv[]) { float x = atof(argv[1]); float y = atof(argv[2]); float xx = x*x; float yy = y*y; float xpy = x+y; float xmy = x-y; float p1 = xx - yy; float p2 = xpy * xmy; printf("x: %f\n",x); printf("y: %f\n",y); printf("x^2 - y^2: \t%f\n",p1); printf("(x+y)*(x-y): \t%f\n",p2); printf("x^2 - y^2: \t%f\n", x*x-y*y); printf("(x+y)*(x-y): \t%f\n",(x+y)*(x-y)); } Find a pair of values that violate the axiom. Which is more accurate: p1 or p2? In a sentence or two explain why. Add all this information to the file you submit to Expertiza. This is not a partnership assignment. Hand in your own work. 17 IEEE standard representation of floating-point values. A 2009-3-03 ~ 2009-3-04 (3 activities) 17.1 Work with MIPS floating-point values.(20 steps) 17.1.1 (Display page) The IEEE 754 standard for floating-point representation The IEEE 754 standard for floating-point representation

As recently as the 1980's, there was little consistency in the way different computer manufacturers represented floating-point values. IBM used a base of 16, while others used 2 or 8. Different numbers of bits were allocated to exponents and mantissas. Rounding policies varied. To a programmer who did a lot of floating-point computation across more than one computing platform, the situation was a mess. A standards committee, sponsored by IEEE and headed by Professor William Kahan of Berkeley, designed a standard floating-point representation that has been a part of virtually every computer invented since 1980. (The story behind the design is here.) There are two formats in the IEEE standard for floating-point values: single precision (32 bits) and double precision (64 bits). (A "quad" precision representation is currently under consideration.) These correspond to types float and double in C. Each representation contains a sign bit, an exponent, and a mantissa. Both representations are signed magnitude; that is, a number is negated merely by inverting its sign bit. In single precision, 8 bits store the exponent; in double precision, 11 bits. Values are normalized, with the mantissa being at least 1 and less than 2 and the exponent adjusted accordingly. As in scientific notation a normalize number has one non-zero digit to the left of the binary point. In base two, this non-zero digit is a 1, so we don't need to actually store it. (Later we'll see denormalized numbers which have a zero there, buf for normlized numbers it has to be a 1.) What's left to represent explicitly is a binary fraction spanning 23 bits (single precision) or 52 bits (double precision). This is essentially a fixed point fraction like you saw in the last brainstorm. These formats appear in diagram form below. Single precision

Double precision

In each case, we are representing the value

(-1S) × (1 + fraction) × 2exponent value The value 0 is a special case, since it can't be normalized. 0 is represented by a word of all-0 bits. 17.1.2 (Display page) Representation of the exponent Representation of the exponent

An interesting design decision was made regarding the exponent. Not all computers in the 1980's implemented floating-point operations in hardware, so the designers wanted a representation that was usable even in the absence of floating-point hardware. An example is sorting records that include floating-point values; if possible, we would like the integer comparison of two floating-point values to produce the same result as a specialized floating-point comparison does. Consider, however, the values 1/2 = 2-1 and 2 = 21. If we store the exponent as a 2's complement number, these values appear as follows (the 1 in parentheses is the "hidden bit", the implicit top bit of the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 72 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

mantissa):

1/2 = 0 11111111 (1) 00000000000000000000000 2 = 0 00000001 (1) 00000000000000000000000 In an integer comparison, 1/2 would appear to be bigger than 2. This was undesirable. Instead, the designers chose biased notation, where the most negative exponent is represented as 00 ... 00 and the most positive as 11 ... 11. The difference between 11...11 and the most positive exponent is called the bias. The IEEE standard uses a bias of 127 (the number of representable values in 8 bits, divided by 2, minus 1) for single precision and 1023 (computed similarly, for an 11-bit exponent) for double precision. Thus we add 127 to the actual single-precision exponent to get the stored exponent. Summarizing, the value represented by a floating-point number is

(-1S) × (1 + fraction) × 2exponent-bias 17.1.3 (Display page) Examples Examples

Here are two examples, one converting a decimal value to single-precision floating point, the other converting in the opposite direction. Converting 0 01111111 10101010100001101000010 to decimal.

1. The sign is 0, so the number is nonnegative. 2. The stored exponent is 01111111 base 2 = 127 base 10. The bias adjustment is 127 - 127 = 0. 3. The fraction doesn't need shifting since the exponent is 0. 4. The fraction is 1 (the "hidden" bit) + 1×2-1 + 0×2-2 + 1×2-3 + 0×2-4 + 1×2-5 + 0×2-6 + 1×2-7 + 0×2-8 + 1×2-9 + 0×2-10 + 0×2-11 + 0×2-12 + 0×2-13 + 1×2-14 + 1×2-15 + 0×2-16 + 1×2-17 + 0×2-18 + 0×2-19 + 0×2-20 + 0×2-21 + 1×2-22 + 0×2-23 = 1 + 2-1 + 2-3 + 2-5 + 2-7 + 2-9 + 2-14 + 2-15 + 2-17 + 2-22 = 1.0 + 0.666115 = 1.666115.

Converting -2.340625 × 101 to binary.

Denormalize, getting -23.40625. Convert the integer part: 23 base 10 is 10111 base 2. Convert the fractional part: .40625 base 10 is .25 + .125 + .03125 = .01101 base 2. Put the parts together and normalize: 10111.01101 = 1.011101101 × 24. Convert the exponent: 127 + 4 = 131 base 10 = 10000011 base 2. Put it all together, omitting the top bit of the normalized mantissa: 1 10000011 01110110100000000000000

17.1.4 (Brainstorm) Another example Convert

0 1000 0000 001 0000 0000 0000 0000 0000 to decimal. Explain how you did it. 17.1.5 (Brainstorm) Yet another example Convert 134.0625 to binary. Explain how you did it. Convert it to IEEE 754 Single Precision floating point. 17.1.6 (Display page) The Floating-Point Representation tool The Floating-Point Representation tool

MARS provides a "Floating-Point Representation" tool, accessible from the Tools menu. It allows easy experimentation with floating-point values, converting either from binary to decimal or vice versa. Take some time now to try it out on the exercises in the previous two steps. 17.1.7 (Display page) Casting from float to int and back Casting from float to int and back

Uses of type casting that we've seen earlier in this course all have involved reinterpreting a given sequence of bits, say, as a char instead of an int, or as one kind of pointer rather than another. A cast from a float to int, however, converts the float value to an integer, truncating any fractional part. Similarly, a cast from an int to float produces the float values that's closest to the int. Consider now the statement

float f; ... if (f == (float) ((int) f)) { printf ("true"); }

This will fail to print "true" for any float value in f that has a fractional part. 17.1.8 (Brainstorm) Casting from int to float and back Describe the values of k for which

int k; ... if (k == (int) ((float) k)) { http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 73 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

printf ("true"); } will not print "true". 17.1.9 (Display page) Representing 0.3 Representing 0.3

Convert 0.3 to the nearest IEEE single-precision floating-point value. Check your answer using the MARS directive

f: .float 0.3 or by using the MARS Floating-Point Representation tool. 17.1.10 (Brainstorm) Difference between 0.3 and its representation? Let x be the floating-point approximation for 0.3 that you just got from MARS. In mathematical terms, what is 0.3–x? Briefly explain. 17.1.11 (Display page) Float representation from C Float representation from C

Wonder how you might build a floating point representation tool in C? You have to work pretty hard to peek at the binary representation underlying FLOATs. You can't just cast them to an int, because the C compiler will insert code to do the conversion. You have to trick it by just pointing to it. Here's an example.

#include #include

void printbits(unsigned int *bits) { int i; for (i=31; i>=0; i--) { printf("%d", (*bits >> i) & 1); } }

int main(int argc, char *argv[]) { int i = atoi(argv[1]); float f = i; printbits(&i); printf("\n"); printbits((int *)&f); printf("\n"); } This is another tool that you can use. 17.1.13 (Brainstorm) Successor values The value 1,000,000 can be exactly represented in IEEE single-precision floating-point format. What is the next larger floating-point value (not necessarily an integer) that can be represented in IEEE single-precision floating-point format? Give both its value and its bit pattern, and explain how you got it. (MARS and its Floating-Point Representation tool may be helpful here.) After answering this question you will see an answer written by the instructor 17.1.14 (Display page) Addition Addition

Floating-point addition works like addition in scientific notation, as we saw earlier. "Adjust one so that they have a common exponent; add the significands; renormalize." Here, for example, is how 2.25 and 134.0625 are added.

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 134.062510 = 0 10000110 (1) 000 0110 0001 0000 0000 00002 We shift the value with the smaller exponent so that the exponents match, thereby denormalizing it. The exponents differ by 6, so

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 becomes

0 10000110 (0) 000 0010 0100 0000 0000 0000 Observe that the "hidden" bit is now 0 because the value is no longer normalized. (The hidden bit is materialized while we're doing operations. It can be hidden again when we're all done.) Now we add:

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 000 0010 0100 0000 0000 0000 = 0 10000110 (1) 000 1000 0101 0000 0000 0000 The answer is already normalized. In general, however, it will need renormalization. Here, for example, is the addition of 2.25 to 255.0625.

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 111 1111 0001 0000 0000 0000 = 0 10000110 (10) 000 0101 0010 0000 0000 0000 The sum has overflowed past the hidden bit. Thus the fraction must be shifted right to renormalize it, giving

0 10000111 (1) 000 0010 1001 0000 0000 0000 17.1.15 (Display page) The MIPS floating-point coprocessor

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 74 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The MIPS floating-point coprocessor

MIPS has special instructions for floating-point operations: single precision double precision add.s add.d sub.s sub.d mul.s mul.d div.s div.d As one might guess, these are more complicated than their integer counterparts, requiring extra hardware and usually taking much longer to compute. How should they be implemented relative to the main CPU? Here are some issues:

It's not a good idea to have instructions that take significantly different amounts of time. (We'll talk about this more when we discuss pipelined architecture.) Many programs do no floating-point computation at all, so a cost-benefit calculation for the extra floating-point hardware needs to be made. Generally, it is rare for a particular piece of data to be used both as an integer and as a floating-point value. Thus only one type of instruction will be used on typical data.

The solution in the 1990's when MIPS was designed was to have a completely separate chip—a coprocessor—that specializes in floating-point computation. On the MIPS, it's called coprocessor 1. The MIPS architecture allows for four coprocessors; we'll meet coprocessor 0 later in the course, when we explore how input/output operations are handled. Another example of a coprocessor is the "Vector Engine" used in Sony's PlayStation 2. These days, floating-point hardware is probably included on the main CPU chip; alternatively, cheap chips may omit it completely and emulate floating-point operations in software. The MIPS floating-point coprocessor contains 32 32-bit registers named $f0, $f1, etc. The .s and .d instructions listed above refer to these registers. By convention, an even/odd pair ($f0/$f1, $f2/$f3, ...) of registers contains one double-precision value. Single-precision computation is done only in even-numbered registers. There are separate load and store instructions: lwc1 ("Load Word into Coprocessor 1") and swc1 ("Store Word from Coprocessor 1"), which work similarly to lw and sw except that they load into and store out of the specified floating-point registers. The program in ~cs61cl/code/floatadd.s provides some examples of floating-point loads, stores, and addition. 17.1.16 (Display page) Finding an x that you can't add one to Finding an x that you can't add 1 to

Find a positive floating-point value x for which x+1.0=x. Verify your result in a MIPS assembly language program, and determine the stored exponent and fraction for your x value (either on the computer or on paper). The provided MIPS program ~cs61cl/code/floatadd.s helps you experiment with adding floating point values. It leaves the sum in $f12 and also $s0, so you can examine the hex representation of the floating point value by printing $s0. 17.1.17 (Brainstorm) Finding the smallest such x Now find the smallest positive floating-point value x for which x+1.0=x. Again, determine the stored exponent and fraction for x. Explain how you found it, and why this you're sure it's the smallest value. After answering this question you will see an answer written by the instructor 17.1.18 (Brainstorm) Floating-point associativity Finally, using what you have learned from the previous two steps, determine three positive floating- point numbers such that adding these numbers in one sequence yields a different value from adding them in another. Briefly explain how you found these numbers. (Hint: Experiment with adding up different amounts of the x value you determined in the previous step, and the value 1.0). This shows that for three floating point numbers a, b, and c, a+b+c does not necessarily equal c+b+a. After answering this question you will see an answer written by the instructor 17.1.19 (Brainstorm) Floating-point multiplication As the old saying goes, 1 and 1 is 10 (in binary). When we add together two n-bit integers the result may be an n+1 bit number. When we multiply two n-bit numbers we have a 2n-bit result. So with multiplication we quickly run out of integer range. Also, you've learned the rule for floating-point addition. It is essentially the rule for addition in scientific notation. State the rule for multiplication in scientific notation. That's probably how floating-point multiplication works. After answering this question you will see an answer written by the instructor 17.2 Explore the handling of computational errors.(9 steps) 17.2.1 (Display page) Overflow Overflow

Overflow results from producing a value too large in magnitude to be represented in the computer. For example, adding 1 to the largest representable integer produces overflow. Overflow is detected in the hardware, and causes an exception—similar to Java's exception construct—that code may be provided to handle. MIPS also provides a way to ignore overflow, namely, to use unsigned operations. P&H note that unsigned integers are commonly used for memory addresses where overflows are ignored. For that reason, the unsigned arithmetic instructions addu, addiu, and subu do not cause exceptions on overflow, while overflow can be caused by add, addi, and sub. 17.2.2 (Display page) Overflow in C C and overflow

Interestingly, all overflow in integer arithmetic is ignored in C. Verify that C ignores integer overflow in two ways: http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 75 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

1. Add 1 to the largest possible positive signed integer, and print the result. 2. Go back to the step "A look ahead at assembly language" in the activity "Explore the programming environment" on the first day of class for directions on using the mips-gcc command. Using the -O0 (letter O and number zero) option, generate the assembly language equivalent of your program for part 1 and examine the arithmetic instructions it uses.

17.2.3 (Brainstorm) Confusion about overflow A CS 61C (note: not 61CL!) student claims that adding 0xffffffff to itself produces overflow. Suggest how the student might be confused, and provide an explanation to set the student straight. 17.2.4 (Brainstorm) Overflow detection P&H note that MIPS can detect overflow via exception, but unlike many other computers it provides no conditional branch to test overflow. The following code does this. Comment the code. (Overflow is indicated by a carry into the sign bit that produces the wrong sign.)

addu $t0,$t1,$t2 xor $t3,$t1,$t2 slt $t3,$t3,$0 bne $t3,$0,noOverflow xor $t3,$t0,$t1 slt $t3,$t3,$0 bne $t3,$0,overflow 17.2.5 (Display page) Floating-point overflow and underflow Floating-point overflow and underflow

Floating-point overflow essentially arises from producing an exponent that's too big for the allotted space (8 bits for single-precision values). There is an analogous condition called underflow, resulting from a negative exponent that's too large to be represented. These two situations are illustrated in

the diagram below. Floating-point overflow and underflow on MIPS raise exceptions, in conformance with the IEEE standard. An interesting problem arises with normalized floating-point numbers, namely, an empty gap around 0. Consider:

-126 -126 The smallest positive representable number, call it x1, is equal to 1.0 ... 2 × 2 = 2 . (Recall that 0 has a zero exponent, so the smallest possible biased exponent for a normalized positive number is 1.) The second smallest representable positive number, call it x2, is

-126 1.000 ... 012 × 2 -126 = (1 + 0.000 ... 012) × 2 = (1 + 2-23) × 2-126 = 2-126 + 2-149

-149 -126 The distance between x1 and x2 is small, 2 . The distance between x1 and 0 is 2 — more than 8,000,000 times greater! The diagram below illustrates this problem. Normalization

and the implicit 1 are to blame.

The solution in the IEEE standard is to allow denormalized values in the gap. A denormalized value has a 0 exponent but a nonzero mantissa. All denormalized values have the same implicit exponent, –126, but they do not have the leading 1. The smallest denormalized value is 2-149; the next smallest is 2-148, closing the gap as shown below.

17.2.6 (Display page) IEEE 754 special cases IEEE 754 special cases

In the days before IEEE 754, computers had a variety of responses to numerical errors. It was common, for instance, to replace an underflow by 0. More bizarre actions are described in William Kahan's history of the IEEE 754 standard. Aside from the variability among platforms, sometimes the actions taken by a given system were mathematically inappropriate. Thus, two principles underlying the IEEE floating-point standard are detection and control. The programmer should be allowed to detect exceptional cases, and then should also have the ability to respond to them if appropriate. To do that, the standard introduced three new constants:

∞ and –∞, the result of dividing by 0; Nan ("Not a Number"), the result of, say, computing the square root of a negative number or computing 0/0.

Infinity is a mathematically useful construct. NaN are defined to "contaminate" a computation; that is, op(NaN, X) = NaN for any mathematical operation, so it is a good tool for debugging computations gone awry. All these constants are represented in the IEEE standard by values with the most positive exponent. For ±∞, the mantissa is all zeroes. For NaN, the mantissa is nonzero. The IEEE standard also requires that these constants be detectable via exceptions: an invalid operation (resulting in NaN and division by zero. As with any exception, the programmer can choose to handle it on his/her own if http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 76 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

appropriate. The table below gives all the cases for the IEEE representation. exponent mantissa object 0 0 0 0 nonzero denorm 1-254 any ± floating-point value 255 0 ±∞ 255 nonzero NaN 17.2.7 (Display page) Roundoff error Roundoff error

Roundoff error, as we have noticed, is the result of the inability to represent a value exactly because of insufficient computer precision. In particular, we have seen examples where a relatively small value was essentially discarded when adding it to a relatively large number. An obvious strategy for problems that require adding a large number of terms is to add the smaller ones first. The program in ~cs61cl/code/sum.c adds the values 1/k2 for k = 1, 2, .... (This sum should converge to π2/6.) Provide code that adds the sum terms in the opposite direction, and see (a) how many terms you can add, and (b) what difference it makes to the computation. 17.2.8 (Display page) Rounding and the IEEE standard Rounding and the IEEE standard

The IEEE standard requires that an exception be available for inexact result (i.e. the occurrence of roundoff). This exception is usually ignored. The standard also prescribes four kinds of rounding, selectable under program control.

round toward +∞ (round "down"); round toward –∞ (round "up"); round toward 0 (truncate); round to nearest representable value, and toward 0 if the approximation is exactly halfway between two representable values.

Rounding to the nearest representable value is the default. This rounding mode tends to balance out inaccuracies. 17.2.9 (Display page) Programmer concerns Programmer concerns

Unfortunately, software developers have not kept pace with the hardware designers. Java, for example, does not provide any way to detect or handle a floating-point exception, or to change the rounding mode. In general,

"Programming languages new (Java) and old (Fortran), and their compilers, still lack competent support for features of IEEE 754 so painstakingly provided by practically all hardware nowadays. S.A.N.E., the Standard Apple Numerical Environment on old MC680x0-based Macs is the main exception. Programmers seem unaware that IEEE 754 is a standard for their programming environment, not just for hardware." (W. Kahan, history of IEEE 754)

Java has other problems as a language for numerical computation; see this article by William Kahan for details. Finally, if you're interested in techniques for minimizing computational error, Math 128A is a good course to take. The prerequisites are Math 53 and 54. Here's the catalog description.

Programming for numerical calculations, round-off error, approximation and interpolation, numerical quadrature, and solution of ordinary differential equations. Practice on the computer.

17.3 Homework(1 step) 17.3.1 (Display page) Upcoming topics Upcoming topics

Activities for lab on Thursday and Friday will involve the process of going from C code to an executable program, in particular, the roles of the assembler and the linker. Relevant readings are P&H sections 2.12 and B.1-B.3 (fourth edition) and sections 2.10 and A.1-A.3 (third edition). 18 IEEE standard representation of floating-point values. 1 2009-3-03 ~ 2009-3-04 (3 activities) 18.1 Work with MIPS floating-point values.(20 steps) 18.1.1 (Display page) The IEEE 754 standard for floating-point representation The IEEE 754 standard for floating-point representation

As recently as the 1980's, there was little consistency in the way different computer manufacturers represented floating-point values. IBM used a base of 16, while others used 2 or 8. Different numbers of bits were allocated to exponents and mantissas. Rounding policies varied. To a programmer who did a lot of floating-point computation across more than one computing platform, the situation was a mess. A standards committee, sponsored by IEEE and headed by Professor William Kahan of Berkeley, designed a standard floating-point representation that has been a part of virtually every computer invented since 1980. (The story behind the design is here.) There are two formats in the IEEE standard for floating-point values: single precision (32 bits) and double precision http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 77 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(64 bits). (A "quad" precision representation is currently under consideration.) These correspond to types float and double in C. Each representation contains a sign bit, an exponent, and a mantissa. Both representations are signed magnitude; that is, a number is negated merely by inverting its sign bit. In single precision, 8 bits store the exponent; in double precision, 11 bits. Values are normalized, with the mantissa being at least 1 and less than 2 and the exponent adjusted accordingly. As in scientific notation a normalize number has one non-zero digit to the left of the binary point. In base two, this non-zero digit is a 1, so we don't need to actually store it. (Later we'll see denormalized numbers which have a zero there, buf for normlized numbers it has to be a 1.) What's left to represent explicitly is a binary fraction spanning 23 bits (single precision) or 52 bits (double precision). This is essentially a fixed point fraction like you saw in the last brainstorm. These formats appear in diagram form below. Single precision

Double precision

In each case, we are representing the value

(-1S) × (1 + fraction) × 2exponent value The value 0 is a special case, since it can't be normalized. 0 is represented by a word of all-0 bits. 18.1.2 (Display page) Representation of the exponent Representation of the exponent

An interesting design decision was made regarding the exponent. Not all computers in the 1980's implemented floating-point operations in hardware, so the designers wanted a representation that was usable even in the absence of floating-point hardware. An example is sorting records that include floating-point values; if possible, we would like the integer comparison of two floating-point values to produce the same result as a specialized floating-point comparison does. Consider, however, the values 1/2 = 2-1 and 2 = 21. If we store the exponent as a 2's complement number, these values appear as follows (the 1 in parentheses is the "hidden bit", the implicit top bit of the mantissa):

1/2 = 0 11111111 (1) 00000000000000000000000 2 = 0 00000001 (1) 00000000000000000000000 In an integer comparison, 1/2 would appear to be bigger than 2. This was undesirable. Instead, the designers chose biased notation, where the most negative exponent is represented as 00 ... 00 and the most positive as 11 ... 11. The difference between 11...11 and the most positive exponent is called the bias. The IEEE standard uses a bias of 127 (the number of representable values in 8 bits, divided by 2, minus 1) for single precision and 1023 (computed similarly, for an 11-bit exponent) for double precision. Thus we add 127 to the actual single-precision exponent to get the stored exponent. Summarizing, the value represented by a floating-point number is

(-1S) × (1 + fraction) × 2exponent-bias 18.1.3 (Display page) Examples Examples

Here are two examples, one converting a decimal value to single-precision floating point, the other converting in the opposite direction. Converting 0 01111111 10101010100001101000010 to decimal.

1. The sign is 0, so the number is nonnegative. 2. The stored exponent is 01111111 base 2 = 127 base 10. The bias adjustment is 127 - 127 = 0. 3. The fraction doesn't need shifting since the exponent is 0. 4. The fraction is 1 (the "hidden" bit) + 1×2-1 + 0×2-2 + 1×2-3 + 0×2-4 + 1×2-5 + 0×2-6 + 1×2-7 + 0×2-8 + 1×2-9 + 0×2-10 + 0×2-11 + 0×2-12 + 0×2-13 + 1×2-14 + 1×2-15 + 0×2-16 + 1×2-17 + 0×2-18 + 0×2-19 + 0×2-20 + 0×2-21 + 1×2-22 + 0×2-23 = 1 + 2-1 + 2-3 + 2-5 + 2-7 + 2-9 + 2-14 + 2-15 + 2-17 + 2-22 = 1.0 + 0.666115 = 1.666115.

Converting -2.340625 × 101 to binary.

Denormalize, getting -23.40625. Convert the integer part: 23 base 10 is 10111 base 2. Convert the fractional part: .40625 base 10 is .25 + .125 + .03125 = .01101 base 2. Put the parts together and normalize: 10111.01101 = 1.011101101 × 24. Convert the exponent: 127 + 4 = 131 base 10 = 10000011 base 2. Put it all together, omitting the top bit of the normalized mantissa: 1 10000011 01110110100000000000000

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 78 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

18.1.4 (Brainstorm) Another example Convert

0 1000 0000 001 0000 0000 0000 0000 0000 to decimal. Explain how you did it. 18.1.5 (Brainstorm) Yet another example Convert 134.0625 to binary. Explain how you did it. Convert it to IEEE 754 Single Precision floating point. 18.1.6 (Display page) The Floating-Point Representation tool The Floating-Point Representation tool

MARS provides a "Floating-Point Representation" tool, accessible from the Tools menu. It allows easy experimentation with floating-point values, converting either from binary to decimal or vice versa. Take some time now to try it out on the exercises in the previous two steps. 18.1.7 (Display page) Casting from float to int and back Casting from float to int and back

Uses of type casting that we've seen earlier in this course all have involved reinterpreting a given sequence of bits, say, as a char instead of an int, or as one kind of pointer rather than another. A cast from a float to int, however, converts the float value to an integer, truncating any fractional part. Similarly, a cast from an int to float produces the float values that's closest to the int. Consider now the statement

float f; ... if (f == (float) ((int) f)) { printf ("true"); }

This will fail to print "true" for any float value in f that has a fractional part. 18.1.8 (Brainstorm) Casting from int to float and back Describe the values of k for which

int k; ... if (k == (int) ((float) k)) { printf ("true"); } will not print "true". 18.1.9 (Display page) Representing 0.3 Representing 0.3

Convert 0.3 to the nearest IEEE single-precision floating-point value. Check your answer using the MARS directive

f: .float 0.3 or by using the MARS Floating-Point Representation tool. 18.1.10 (Brainstorm) Difference between 0.3 and its representation? Let x be the floating-point approximation for 0.3 that you just got from MARS. In mathematical terms, what is 0.3–x? Briefly explain. 18.1.11 (Display page) Float representation from C Float representation from C

Wonder how you might build a floating point representation tool in C? You have to work pretty hard to peek at the binary representation underlying FLOATs. You can't just cast them to an int, because the C compiler will insert code to do the conversion. You have to trick it by just pointing to it. Here's an example.

#include #include

void printbits(unsigned int *bits) { int i; for (i=31; i>=0; i--) { printf("%d", (*bits >> i) & 1); } }

int main(int argc, char *argv[]) { int i = atoi(argv[1]); float f = i; printbits(&i); printf("\n"); printbits((int *)&f); printf("\n"); } This is another tool that you can use. 18.1.13 (Brainstorm) Successor values The value 1,000,000 can be exactly represented in IEEE single-precision floating-point format. What is the next larger floating-point value (not necessarily an integer) that can be represented in IEEE single-precision floating-point format? Give both its value and its bit pattern, and explain how you got it. (MARS and its Floating-Point Representation tool may be helpful here.) After answering this http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 79 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

question you will not see other student's answers 18.1.14 (Display page) Addition Addition

Floating-point addition works like addition in scientific notation, as we saw earlier. "Adjust one so that they have a common exponent; add the significands; renormalize." Here, for example, is how 2.25 and 134.0625 are added.

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 134.062510 = 0 10000110 (1) 000 0110 0001 0000 0000 00002 We shift the value with the smaller exponent so that the exponents match, thereby denormalizing it. The exponents differ by 6, so

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 becomes

0 10000110 (0) 000 0010 0100 0000 0000 0000 Observe that the "hidden" bit is now 0 because the value is no longer normalized. (The hidden bit is materialized while we're doing operations. It can be hidden again when we're all done.) Now we add:

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 000 0010 0100 0000 0000 0000 = 0 10000110 (1) 000 1000 0101 0000 0000 0000 The answer is already normalized. In general, however, it will need renormalization. Here, for example, is the addition of 2.25 to 255.0625.

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 111 1111 0001 0000 0000 0000 = 0 10000110 (10) 000 0101 0010 0000 0000 0000 The sum has overflowed past the hidden bit. Thus the fraction must be shifted right to renormalize it, giving

0 10000111 (1) 000 0010 1001 0000 0000 0000 18.1.15 (Display page) The MIPS floating-point coprocessor The MIPS floating-point coprocessor

MIPS has special instructions for floating-point operations: single precision double precision add.s add.d sub.s sub.d mul.s mul.d div.s div.d As one might guess, these are more complicated than their integer counterparts, requiring extra hardware and usually taking much longer to compute. How should they be implemented relative to the main CPU? Here are some issues:

It's not a good idea to have instructions that take significantly different amounts of time. (We'll talk about this more when we discuss pipelined architecture.) Many programs do no floating-point computation at all, so a cost-benefit calculation for the extra floating-point hardware needs to be made. Generally, it is rare for a particular piece of data to be used both as an integer and as a floating-point value. Thus only one type of instruction will be used on typical data.

The solution in the 1990's when MIPS was designed was to have a completely separate chip—a coprocessor—that specializes in floating-point computation. On the MIPS, it's called coprocessor 1. The MIPS architecture allows for four coprocessors; we'll meet coprocessor 0 later in the course, when we explore how input/output operations are handled. Another example of a coprocessor is the "Vector Engine" used in Sony's PlayStation 2. These days, floating-point hardware is probably included on the main CPU chip; alternatively, cheap chips may omit it completely and emulate floating-point operations in software. The MIPS floating-point coprocessor contains 32 32-bit registers named $f0, $f1, etc. The .s and .d instructions listed above refer to these registers. By convention, an even/odd pair ($f0/$f1, $f2/$f3, ...) of registers contains one double-precision value. Single-precision computation is done only in even-numbered registers. There are separate load and store instructions: lwc1 ("Load Word into Coprocessor 1") and swc1 ("Store Word from Coprocessor 1"), which work similarly to lw and sw except that they load into and store out of the specified floating-point registers. The program in ~cs61cl/code/floatadd.s provides some examples of floating-point loads, stores, and addition. 18.1.16 (Display page) Finding an x that you can't add one to Finding an x that you can't add 1 to

Find a positive floating-point value x for which x+1.0=x. Verify your result in a MIPS assembly language program, and determine the stored exponent and fraction for your x value (either on the computer or on paper). The provided MIPS program ~cs61cl/code/floatadd.s helps you experiment with adding floating point values. It leaves the sum in $f12 and also $s0, so you can examine the hex representation of the floating point value by printing $s0. 18.1.17 (Brainstorm) Finding the smallest such x http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 80 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Now find the smallest positive floating-point value x for which x+1.0=x. Again, determine the stored exponent and fraction for x. Explain how you found it, and why this you're sure it's the smallest value. After answering this question you will not see other student's answers 18.1.18 (Brainstorm) Floating-point associativity Finally, using what you have learned from the previous two steps, determine three positive floating- point numbers such that adding these numbers in one sequence yields a different value from adding them in another. Briefly explain how you found these numbers. (Hint: Experiment with adding up different amounts of the x value you determined in the previous step, and the value 1.0). This shows that for three floating point numbers a, b, and c, a+b+c does not necessarily equal c+b+a. After answering this question you will not see other student's answers 18.1.19 (Brainstorm) Floating-point multiplication As the old saying goes, 1 and 1 is 10 (in binary). When we add together two n-bit integers the result may be an n+1 bit number. When we multiply two n-bit numbers we have a 2n-bit result. So with multiplication we quickly run out of integer range. Also, you've learned the rule for floating-point addition. It is essentially the rule for addition in scientific notation. State the rule for multiplication in scientific notation. That's probably how floating-point multiplication works. After answering this question you will not see other student's answers 18.2 Explore the handling of computational errors.(9 steps) 18.2.1 (Display page) Overflow Overflow

Overflow results from producing a value too large in magnitude to be represented in the computer. For example, adding 1 to the largest representable integer produces overflow. Overflow is detected in the hardware, and causes an exception—similar to Java's exception construct—that code may be provided to handle. MIPS also provides a way to ignore overflow, namely, to use unsigned operations. P&H note that unsigned integers are commonly used for memory addresses where overflows are ignored. For that reason, the unsigned arithmetic instructions addu, addiu, and subu do not cause exceptions on overflow, while overflow can be caused by add, addi, and sub. 18.2.2 (Display page) Overflow in C C and overflow

Interestingly, all overflow in integer arithmetic is ignored in C. Verify that C ignores integer overflow in two ways:

1. Add 1 to the largest possible positive signed integer, and print the result. 2. Go back to the step "A look ahead at assembly language" in the activity "Explore the programming environment" on the first day of class for directions on using the mips-gcc command. Using the -O0 (letter O and number zero) option, generate the assembly language equivalent of your program for part 1 and examine the arithmetic instructions it uses.

18.2.3 (Brainstorm) Confusion about overflow A CS 61C (note: not 61CL!) student claims that adding 0xffffffff to itself produces overflow. Suggest how the student might be confused, and provide an explanation to set the student straight. 18.2.4 (Brainstorm) Overflow detection P&H note that MIPS can detect overflow via exception, but unlike many other computers it provides no conditional branch to test overflow. The following code does this. Comment the code. (Overflow is indicated by a carry into the sign bit that produces the wrong sign.)

addu $t0,$t1,$t2 xor $t3,$t1,$t2 slt $t3,$t3,$0 bne $t3,$0,noOverflow xor $t3,$t0,$t1 slt $t3,$t3,$0 bne $t3,$0,overflow 18.2.5 (Display page) Floating-point overflow and underflow Floating-point overflow and underflow

Floating-point overflow essentially arises from producing an exponent that's too big for the allotted space (8 bits for single-precision values). There is an analogous condition called underflow, resulting from a negative exponent that's too large to be represented. These two situations are illustrated in

the diagram below. Floating-point overflow and underflow on MIPS raise exceptions, in conformance with the IEEE standard. An interesting problem arises with normalized floating-point numbers, namely, an empty gap around 0. Consider:

-126 -126 The smallest positive representable number, call it x1, is equal to 1.0 ... 2 × 2 = 2 . (Recall that 0 has a zero exponent, so the smallest possible biased exponent for a normalized positive number is 1.) The second smallest representable positive number, call it x2, is

-126 1.000 ... 012 × 2 -126 = (1 + 0.000 ... 012) × 2 = (1 + 2-23) × 2-126 = 2-126 + 2-149

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 81 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

-149 -126 The distance between x1 and x2 is small, 2 . The distance between x1 and 0 is 2 — more than 8,000,000 times greater! The diagram below illustrates this problem. Normalization

and the implicit 1 are to blame.

The solution in the IEEE standard is to allow denormalized values in the gap. A denormalized value has a 0 exponent but a nonzero mantissa. All denormalized values have the same implicit exponent, –126, but they do not have the leading 1. The smallest denormalized value is 2-149; the next smallest is 2-148, closing the gap as shown below.

18.2.6 (Display page) IEEE 754 special cases IEEE 754 special cases

In the days before IEEE 754, computers had a variety of responses to numerical errors. It was common, for instance, to replace an underflow by 0. More bizarre actions are described in William Kahan's history of the IEEE 754 standard. Aside from the variability among platforms, sometimes the actions taken by a given system were mathematically inappropriate. Thus, two principles underlying the IEEE floating-point standard are detection and control. The programmer should be allowed to detect exceptional cases, and then should also have the ability to respond to them if appropriate. To do that, the standard introduced three new constants:

∞ and –∞, the result of dividing by 0; Nan ("Not a Number"), the result of, say, computing the square root of a negative number or computing 0/0.

Infinity is a mathematically useful construct. NaN are defined to "contaminate" a computation; that is, op(NaN, X) = NaN for any mathematical operation, so it is a good tool for debugging computations gone awry. All these constants are represented in the IEEE standard by values with the most positive exponent. For ±∞, the mantissa is all zeroes. For NaN, the mantissa is nonzero. The IEEE standard also requires that these constants be detectable via exceptions: an invalid operation (resulting in NaN and division by zero. As with any exception, the programmer can choose to handle it on his/her own if appropriate. The table below gives all the cases for the IEEE representation. exponent mantissa object 0 0 0 0 nonzero denorm 1-254 any ± floating-point value 255 0 ±∞ 255 nonzero NaN 18.2.7 (Display page) Roundoff error Roundoff error

Roundoff error, as we have noticed, is the result of the inability to represent a value exactly because of insufficient computer precision. In particular, we have seen examples where a relatively small value was essentially discarded when adding it to a relatively large number. An obvious strategy for problems that require adding a large number of terms is to add the smaller ones first. The program in ~cs61cl/code/sum.c adds the values 1/k2 for k = 1, 2, .... (This sum should converge to π2/6.) Provide code that adds the sum terms in the opposite direction, and see (a) how many terms you can add, and (b) what difference it makes to the computation. 18.2.8 (Display page) Rounding and the IEEE standard Rounding and the IEEE standard

The IEEE standard requires that an exception be available for inexact result (i.e. the occurrence of roundoff). This exception is usually ignored. The standard also prescribes four kinds of rounding, selectable under program control.

round toward +∞ (round "down"); round toward –∞ (round "up"); round toward 0 (truncate); round to nearest representable value, and toward 0 if the approximation is exactly halfway between two representable values.

Rounding to the nearest representable value is the default. This rounding mode tends to balance out inaccuracies. 18.2.9 (Display page) Programmer concerns Programmer concerns

Unfortunately, software developers have not kept pace with the hardware designers. Java, for example, does not provide any way to detect or handle a floating-point exception, or to change the rounding mode. In general,

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 82 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

"Programming languages new (Java) and old (Fortran), and their compilers, still lack competent support for features of IEEE 754 so painstakingly provided by practically all hardware nowadays. S.A.N.E., the Standard Apple Numerical Environment on old MC680x0-based Macs is the main exception. Programmers seem unaware that IEEE 754 is a standard for their programming environment, not just for hardware." (W. Kahan, history of IEEE 754)

Java has other problems as a language for numerical computation; see this article by William Kahan for details. Finally, if you're interested in techniques for minimizing computational error, Math 128A is a good course to take. The prerequisites are Math 53 and 54. Here's the catalog description.

Programming for numerical calculations, round-off error, approximation and interpolation, numerical quadrature, and solution of ordinary differential equations. Practice on the computer.

18.3 Homework(1 step) 18.3.1 (Display page) Upcoming topics Upcoming topics

Activities for lab on Thursday and Friday will involve the process of going from C code to an executable program, in particular, the roles of the assembler and the linker. Relevant readings are P&H sections 2.12 and B.1-B.3 (fourth edition) and sections 2.10 and A.1-A.3 (third edition). 19 IEEE standard representation of floating-point values. AA 2009-3-03 ~ 2009-3-04 (3 activities) 19.1 Work with MIPS floating-point values.(24 steps) 19.1.1 (Display page) The IEEE 754 standard for floating-point representation The IEEE 754 standard for floating-point representation

As recently as the 1980's, there was little consistency in the way different computer manufacturers represented floating-point values. IBM used a base of 16, while others used 2 or 8. Different numbers of bits were allocated to exponents and mantissas. Rounding policies varied. To a programmer who did a lot of floating-point computation across more than one computing platform, the situation was a mess. A standards committee, sponsored by IEEE and headed by Professor William Kahan of Berkeley, designed a standard floating-point representation that has been a part of virtually every computer invented since 1980. (The story behind the design is here.) There are two formats in the IEEE standard for floating-point values: single precision (32 bits) and double precision (64 bits). (A "quad" precision representation is currently under consideration.) These correspond to types float and double in C. Each representation contains a sign bit, an exponent, and a mantissa. Both representations are signed magnitude; that is, a number is negated merely by inverting its sign bit. In single precision, 8 bits store the exponent; in double precision, 11 bits. Values are normalized, with the mantissa being at least 1 and less than 2 and the exponent adjusted accordingly. As in scientific notation a normalize number has one non-zero digit to the left of the binary point. In base two, this non-zero digit is a 1, so we don't need to actually store it. (Later we'll see denormalized numbers which have a zero there, buf for normlized numbers it has to be a 1.) What's left to represent explicitly is a binary fraction spanning 23 bits (single precision) or 52 bits (double precision). This is essentially a fixed point fraction like you saw in the last brainstorm. These formats appear in diagram form below. Single precision

Double precision

In each case, we are representing the value

(-1S) × (1 + fraction) × 2exponent value The value 0 is a special case, since it can't be normalized. 0 is represented by a word of all-0 bits. 19.1.2 (Display page) Representation of the exponent Representation of the exponent

An interesting design decision was made regarding the exponent. Not all computers in the 1980's implemented floating-point operations in hardware, so the designers wanted a representation that was usable even in the absence of floating-point hardware. An example is sorting records that include floating-point values; if possible, we would like the integer comparison of two floating-point values to produce the same result as a specialized floating-point comparison does. Consider, however, the values 1/2 = 2-1 and 2 = 21. If we store the exponent as a 2's complement number, these values appear as follows (the 1 in parentheses is the "hidden bit", the implicit top bit of the mantissa):

1/2 = 0 11111111 (1) 00000000000000000000000 2 = 0 00000001 (1) 00000000000000000000000 In an integer comparison, 1/2 would appear to be bigger than 2. This was undesirable. Instead, the designers chose biased notation, where the most negative exponent is represented as 00 ... 00 and the most positive as 11 ... 11. The difference between 11...11 and the most positive exponent is called the bias. The IEEE standard uses a bias of 127 (the number of representable values in 8 bits,

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 83 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

divided by 2, minus 1) for single precision and 1023 (computed similarly, for an 11-bit exponent) for double precision. Thus we add 127 to the actual single-precision exponent to get the stored exponent. Summarizing, the value represented by a floating-point number is

(-1S) × (1 + fraction) × 2exponent-bias 19.1.3 (Display page) Examples Examples

Here are two examples, one converting a decimal value to single-precision floating point, the other converting in the opposite direction. Converting 0 01111111 10101010100001101000010 to decimal.

1. The sign is 0, so the number is nonnegative. 2. The stored exponent is 01111111 base 2 = 127 base 10. The bias adjustment is 127 - 127 = 0. 3. The fraction doesn't need shifting since the exponent is 0. 4. The fraction is 1 (the "hidden" bit) + 1×2-1 + 0×2-2 + 1×2-3 + 0×2-4 + 1×2-5 + 0×2-6 + 1×2-7 + 0×2-8 + 1×2-9 + 0×2-10 + 0×2-11 + 0×2-12 + 0×2-13 + 1×2-14 + 1×2-15 + 0×2-16 + 1×2-17 + 0×2-18 + 0×2-19 + 0×2-20 + 0×2-21 + 1×2-22 + 0×2-23 = 1 + 2-1 + 2-3 + 2-5 + 2-7 + 2-9 + 2-14 + 2-15 + 2-17 + 2-22 = 1.0 + 0.666115 = 1.666115.

Converting -2.340625 × 101 to binary.

Denormalize, getting -23.40625. Convert the integer part: 23 base 10 is 10111 base 2. Convert the fractional part: .40625 base 10 is .25 + .125 + .03125 = .01101 base 2. Put the parts together and normalize: 10111.01101 = 1.011101101 × 24. Convert the exponent: 127 + 4 = 131 base 10 = 10000011 base 2. Put it all together, omitting the top bit of the normalized mantissa: 1 10000011 01110110100000000000000

19.1.4 (Brainstorm) Another example Convert

0 1000 0000 001 0000 0000 0000 0000 0000 to decimal. Explain how you did it. 19.1.5 (Brainstorm) Yet another example Convert 134.0625 to binary. Explain how you did it. Convert it to IEEE 754 Single Precision floating point. 19.1.6 (Display page) The Floating-Point Representation tool The Floating-Point Representation tool

MARS provides a "Floating-Point Representation" tool, accessible from the Tools menu. It allows easy experimentation with floating-point values, converting either from binary to decimal or vice versa. Take some time now to try it out on the exercises in the previous two steps. 19.1.7 (Display page) Casting from float to int and back Casting from float to int and back

Uses of type casting that we've seen earlier in this course all have involved reinterpreting a given sequence of bits, say, as a char instead of an int, or as one kind of pointer rather than another. A cast from a float to int, however, converts the float value to an integer, truncating any fractional part. Similarly, a cast from an int to float produces the float values that's closest to the int. Consider now the statement

float f; ... if (f == (float) ((int) f)) { printf ("true"); }

This will fail to print "true" for any float value in f that has a fractional part. 19.1.8 (Brainstorm) Casting from int to float and back Describe the values of k for which

int k; ... if (k == (int) ((float) k)) { printf ("true"); } will not print "true". 19.1.9 (Display page) Representing 0.3 Representing 0.3

Convert 0.3 to the nearest IEEE single-precision floating-point value. Check your answer using the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 84 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

MARS directive

f: .float 0.3 or by using the MARS Floating-Point Representation tool. 19.1.10 (Brainstorm) Difference between 0.3 and its representation? Let x be the floating-point approximation for 0.3 that you just got from MARS. In mathematical terms, what is 0.3–x? Briefly explain. 19.1.11 (Display page) Float representation from C Float representation from C

Wonder how you might build a floating point representation tool in C? You have to work pretty hard to peek at the binary representation underlying FLOATs. You can't just cast them to an int, because the C compiler will insert code to do the conversion. You have to trick it by just pointing to it. Here's an example.

#include #include

void printbits(unsigned int *bits) { int i; for (i=31; i>=0; i--) { printf("%d", (*bits >> i) & 1); } }

int main(int argc, char *argv[]) { int i = atoi(argv[1]); float f = i; printbits(&i); printf("\n"); printbits((int *)&f); printf("\n"); } This is another tool that you can use. 19.1.13 (Display page) Successor Values The value 1,000,000 can be exactly represented in IEEE single-precision floating-point format. What is the next larger floating-point value (not necessarily an integer) that can be represented in IEEE single-precision floating-point format? Give both its value and its bit pattern, and explain how you got it. (MARS and its Floating-Point Representation tool may be helpful here.) You can view an answer to this question in the following step. 19.1.14 (Display page) Successor Values Answer The value 1,000,000 can be exactly represented in IEEE single-precision floating-point format. What is the next larger floating-point value (not necessarily an integer) that can be represented in IEEE single-precision floating-point format? Give both its value and its bit pattern, and explain how you got it. (MARS and its Floating-Point Representation tool may be helpful here.) Answer: 1,000,000 in IEEE single precision floating point is 0 10010010 11101000010010000000000. To get the next largest representable number, we can put a 1 in the least significant bit of the fraction, which gives us 1,000,000.06. 19.1.15 (Display page) Addition Addition

Floating-point addition works like addition in scientific notation, as we saw earlier. "Adjust one so that they have a common exponent; add the significands; renormalize." Here, for example, is how 2.25 and 134.0625 are added.

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 134.062510 = 0 10000110 (1) 000 0110 0001 0000 0000 00002 We shift the value with the smaller exponent so that the exponents match, thereby denormalizing it. The exponents differ by 6, so

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 becomes

0 10000110 (0) 000 0010 0100 0000 0000 0000 Observe that the "hidden" bit is now 0 because the value is no longer normalized. (The hidden bit is materialized while we're doing operations. It can be hidden again when we're all done.) Now we add:

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 000 0010 0100 0000 0000 0000 = 0 10000110 (1) 000 1000 0101 0000 0000 0000 The answer is already normalized. In general, however, it will need renormalization. Here, for example, is the addition of 2.25 to 255.0625.

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 111 1111 0001 0000 0000 0000 = 0 10000110 (10) 000 0101 0010 0000 0000 0000 The sum has overflowed past the hidden bit. Thus the fraction must be shifted right to renormalize it, giving

0 10000111 (1) 000 0010 1001 0000 0000 0000 19.1.16 (Display page) The MIPS floating-point coprocessor http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 85 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The MIPS floating-point coprocessor

MIPS has special instructions for floating-point operations: single precision double precision add.s add.d sub.s sub.d mul.s mul.d div.s div.d As one might guess, these are more complicated than their integer counterparts, requiring extra hardware and usually taking much longer to compute. How should they be implemented relative to the main CPU? Here are some issues:

It's not a good idea to have instructions that take significantly different amounts of time. (We'll talk about this more when we discuss pipelined architecture.) Many programs do no floating-point computation at all, so a cost-benefit calculation for the extra floating-point hardware needs to be made. Generally, it is rare for a particular piece of data to be used both as an integer and as a floating-point value. Thus only one type of instruction will be used on typical data.

The solution in the 1990's when MIPS was designed was to have a completely separate chip—a coprocessor—that specializes in floating-point computation. On the MIPS, it's called coprocessor 1. The MIPS architecture allows for four coprocessors; we'll meet coprocessor 0 later in the course, when we explore how input/output operations are handled. Another example of a coprocessor is the "Vector Engine" used in Sony's PlayStation 2. These days, floating-point hardware is probably included on the main CPU chip; alternatively, cheap chips may omit it completely and emulate floating-point operations in software. The MIPS floating-point coprocessor contains 32 32-bit registers named $f0, $f1, etc. The .s and .d instructions listed above refer to these registers. By convention, an even/odd pair ($f0/$f1, $f2/$f3, ...) of registers contains one double-precision value. Single-precision computation is done only in even-numbered registers. There are separate load and store instructions: lwc1 ("Load Word into Coprocessor 1") and swc1 ("Store Word from Coprocessor 1"), which work similarly to lw and sw except that they load into and store out of the specified floating-point registers. The program in ~cs61cl/code/floatadd.s provides some examples of floating-point loads, stores, and addition. 19.1.17 (Display page) Finding an x that you can't add one to Finding an x that you can't add 1 to

Find a positive floating-point value x for which x+1.0=x. Verify your result in a MIPS assembly language program, and determine the stored exponent and fraction for your x value (either on the computer or on paper). The provided MIPS program ~cs61cl/code/floatadd.s helps you experiment with adding floating point values. It leaves the sum in $f12 and also $s0, so you can examine the hex representation of the floating point value by printing $s0. 19.1.18 (Display page) Finding the smallest such x Now find the smallest positive floating-point value x for which x+1.0=x. Again, determine the stored exponent and fraction for x. Explain how you found it, and why this you're sure it's the smallest value. You can view an answer to this question in the following step. 19.1.19 (Display page) Finding the smallest such x Answer Now find the smallest positive floating-point value x for which x+1.0=x. Again, determine the stored exponent and fraction for x. Explain how you found it, and why this you're sure it's the smallest value. Answer: x = 2^24 is the smallest positive floating-point value for which x + 1.0 = x. Because the fraction part of the single precision float-point number is 23-bit long, any positive floating-point bigger than 2^23 will satisfy the claim, and 2^24 is the smallest one among them. 19.1.20 (Display page) Floating point associativity Finally, using what you have learned from the previous two steps, determine three positive floating- point numbers such that adding these numbers in one sequence yields a different value from adding them in another. Briefly explain how you found these numbers. (Hint: Experiment with adding up different amounts of the x value you determined in the previous step, and the value 1.0). This shows that for three floating point numbers a, b, and c, a+b+c does not necessarily equal c+b+a. You can view an answer to this question in the following step. 19.1.21 (Display page) Floating point associativity Answer Finally, using what you have learned from the previous two steps, determine three positive floating- point numbers such that adding these numbers in one sequence yields a different value from adding them in another. Briefly explain how you found these numbers. (Hint: Experiment with adding up different amounts of the x value you determined in the previous step, and the value 1.0). This shows that for three floating point numbers a, b, and c, a+b+c does not necessarily equal c+b+a. Answer: a = 1 b = 1 c = 2^24 19.1.22 (Display page) Floating Point Multiplication As the old saying goes, 1 and 1 is 10 (in binary). When we add together two n-bit integers the result may be an n+1 bit number. When we multiply two n-bit numbers we have a 2n-bit result. So with multiplication we quickly run out of integer range. Also, you've learned the rule for floating-point addition. It is essentially the rule for addition in scientific notation. State the rule for multiplication in scientific notation. That's probably how floating-point multiplication works. You can view an answer to this question in the following step. 19.1.23 (Display page) Floating Point Multiplication Answer As the old saying goes, 1 and 1 is 10 (in binary). When we add together two n-bit integers the result may be an n+1 bit number. When we multiply two n-bit numbers we have a 2n-bit result. So with http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 86 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

multiplication we quickly run out of integer range. Also, you've learned the rule for floating-point addition. It is essentially the rule for addition in scientific notation. State the rule for multiplication in scientific notation. That's probably how floating-point multiplication works. Answer: Rule for multiplication in scientific notation: 1. add the exponents 2. multiply the mantissas 3. normalize FP multiplication follows the above rule. Fraction multiplication involves truncating the resultant fraction to 23 bits. Also, the hidden bit has to be included in the operation. The sign bits need to be added. 19.2 Explore the handling of computational errors.(9 steps) 19.2.1 (Display page) Overflow Overflow

Overflow results from producing a value too large in magnitude to be represented in the computer. For example, adding 1 to the largest representable integer produces overflow. Overflow is detected in the hardware, and causes an exception—similar to Java's exception construct—that code may be provided to handle. MIPS also provides a way to ignore overflow, namely, to use unsigned operations. P&H note that unsigned integers are commonly used for memory addresses where overflows are ignored. For that reason, the unsigned arithmetic instructions addu, addiu, and subu do not cause exceptions on overflow, while overflow can be caused by add, addi, and sub. 19.2.2 (Display page) Overflow in C C and overflow

Interestingly, all overflow in integer arithmetic is ignored in C. Verify that C ignores integer overflow in two ways:

1. Add 1 to the largest possible positive signed integer, and print the result. 2. Go back to the step "A look ahead at assembly language" in the activity "Explore the programming environment" on the first day of class for directions on using the mips-gcc command. Using the -O0 (letter O and number zero) option, generate the assembly language equivalent of your program for part 1 and examine the arithmetic instructions it uses.

19.2.3 (Brainstorm) Confusion about overflow A CS 61C (note: not 61CL!) student claims that adding 0xffffffff to itself produces overflow. Suggest how the student might be confused, and provide an explanation to set the student straight. 19.2.4 (Brainstorm) Overflow detection P&H note that MIPS can detect overflow via exception, but unlike many other computers it provides no conditional branch to test overflow. The following code does this. Comment the code. (Overflow is indicated by a carry into the sign bit that produces the wrong sign.)

addu $t0,$t1,$t2 xor $t3,$t1,$t2 slt $t3,$t3,$0 bne $t3,$0,noOverflow xor $t3,$t0,$t1 slt $t3,$t3,$0 bne $t3,$0,overflow 19.2.5 (Display page) Floating-point overflow and underflow Floating-point overflow and underflow

Floating-point overflow essentially arises from producing an exponent that's too big for the allotted space (8 bits for single-precision values). There is an analogous condition called underflow, resulting from a negative exponent that's too large to be represented. These two situations are illustrated in

the diagram below. Floating-point overflow and underflow on MIPS raise exceptions, in conformance with the IEEE standard. An interesting problem arises with normalized floating-point numbers, namely, an empty gap around 0. Consider:

-126 -126 The smallest positive representable number, call it x1, is equal to 1.0 ... 2 × 2 = 2 . (Recall that 0 has a zero exponent, so the smallest possible biased exponent for a normalized positive number is 1.) The second smallest representable positive number, call it x2, is

-126 1.000 ... 012 × 2 -126 = (1 + 0.000 ... 012) × 2 = (1 + 2-23) × 2-126 = 2-126 + 2-149

-149 -126 The distance between x1 and x2 is small, 2 . The distance between x1 and 0 is 2 — more than 8,000,000 times greater! The diagram below illustrates this problem. Normalization

and the implicit 1 are to blame. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 87 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The solution in the IEEE standard is to allow denormalized values in the gap. A denormalized value has a 0 exponent but a nonzero mantissa. All denormalized values have the same implicit exponent, –126, but they do not have the leading 1. The smallest denormalized value is 2-149; the next smallest is 2-148, closing the gap as shown below.

19.2.6 (Display page) IEEE 754 special cases IEEE 754 special cases

In the days before IEEE 754, computers had a variety of responses to numerical errors. It was common, for instance, to replace an underflow by 0. More bizarre actions are described in William Kahan's history of the IEEE 754 standard. Aside from the variability among platforms, sometimes the actions taken by a given system were mathematically inappropriate. Thus, two principles underlying the IEEE floating-point standard are detection and control. The programmer should be allowed to detect exceptional cases, and then should also have the ability to respond to them if appropriate. To do that, the standard introduced three new constants:

∞ and –∞, the result of dividing by 0; Nan ("Not a Number"), the result of, say, computing the square root of a negative number or computing 0/0.

Infinity is a mathematically useful construct. NaN are defined to "contaminate" a computation; that is, op(NaN, X) = NaN for any mathematical operation, so it is a good tool for debugging computations gone awry. All these constants are represented in the IEEE standard by values with the most positive exponent. For ±∞, the mantissa is all zeroes. For NaN, the mantissa is nonzero. The IEEE standard also requires that these constants be detectable via exceptions: an invalid operation (resulting in NaN and division by zero. As with any exception, the programmer can choose to handle it on his/her own if appropriate. The table below gives all the cases for the IEEE representation. exponent mantissa object 0 0 0 0 nonzero denorm 1-254 any ± floating-point value 255 0 ±∞ 255 nonzero NaN 19.2.7 (Display page) Roundoff error Roundoff error

Roundoff error, as we have noticed, is the result of the inability to represent a value exactly because of insufficient computer precision. In particular, we have seen examples where a relatively small value was essentially discarded when adding it to a relatively large number. An obvious strategy for problems that require adding a large number of terms is to add the smaller ones first. The program in ~cs61cl/code/sum.c adds the values 1/k2 for k = 1, 2, .... (This sum should converge to π2/6.) Provide code that adds the sum terms in the opposite direction, and see (a) how many terms you can add, and (b) what difference it makes to the computation. 19.2.8 (Display page) Rounding and the IEEE standard Rounding and the IEEE standard

The IEEE standard requires that an exception be available for inexact result (i.e. the occurrence of roundoff). This exception is usually ignored. The standard also prescribes four kinds of rounding, selectable under program control.

round toward +∞ (round "down"); round toward –∞ (round "up"); round toward 0 (truncate); round to nearest representable value, and toward 0 if the approximation is exactly halfway between two representable values.

Rounding to the nearest representable value is the default. This rounding mode tends to balance out inaccuracies. 19.2.9 (Display page) Programmer concerns Programmer concerns

Unfortunately, software developers have not kept pace with the hardware designers. Java, for example, does not provide any way to detect or handle a floating-point exception, or to change the rounding mode. In general,

"Programming languages new (Java) and old (Fortran), and their compilers, still lack competent support for features of IEEE 754 so painstakingly provided by practically all hardware nowadays. S.A.N.E., the Standard Apple Numerical Environment on old MC680x0-based Macs is the main exception. Programmers seem unaware that IEEE 754 is a standard for their programming environment, not just for hardware." (W. Kahan, history of IEEE 754)

Java has other problems as a language for numerical computation; see this article by William Kahan for details. Finally, if you're interested in techniques for minimizing computational error, Math 128A is a good course to take. The prerequisites are Math 53 and 54. Here's the catalog description. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 88 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Programming for numerical calculations, round-off error, approximation and interpolation, numerical quadrature, and solution of ordinary differential equations. Practice on the computer.

19.3 Homework(1 step) 19.3.1 (Display page) Upcoming topics Upcoming topics

Activities for lab on Thursday and Friday will involve the process of going from C code to an executable program, in particular, the roles of the assembler and the linker. Relevant readings are P&H sections 2.12 and B.1-B.3 (fourth edition) and sections 2.10 and A.1-A.3 (third edition). 20 IEEE standard representation of floating-point values 2009-3-03 ~ 2009-3-04 (3 activities) 20.1 Work with MIPS floating-point values.(20 steps) 20.1.1 (Display page) The IEEE 754 standard for floating-point representation The IEEE 754 standard for floating-point representation

As recently as the 1980's, there was little consistency in the way different computer manufacturers represented floating-point values. IBM used a base of 16, while others used 2 or 8. Different numbers of bits were allocated to exponents and mantissas. Rounding policies varied. To a programmer who did a lot of floating-point computation across more than one computing platform, the situation was a mess. A standards committee, sponsored by IEEE and headed by Professor William Kahan of Berkeley, designed a standard floating-point representation that has been a part of virtually every computer invented since 1980. (The story behind the design is here.) There are two formats in the IEEE standard for floating-point values: single precision (32 bits) and double precision (64 bits). (A "quad" precision representation is currently under consideration.) These correspond to types float and double in C. Each representation contains a sign bit, an exponent, and a mantissa. Both representations are signed magnitude; that is, a number is negated merely by inverting its sign bit. In single precision, 8 bits store the exponent; in double precision, 11 bits. Values are normalized, with the mantissa being at least 1 and less than 2 and the exponent adjusted accordingly. As in scientific notation a normalize number has one non-zero digit to the left of the binary point. In base two, this non-zero digit is a 1, so we don't need to actually store it. (Later we'll see denormalized numbers which have a zero there, buf for normlized numbers it has to be a 1.) What's left to represent explicitly is a binary fraction spanning 23 bits (single precision) or 52 bits (double precision). This is essentially a fixed point fraction like you saw in the last brainstorm. These formats appear in diagram form below. Single precision

Double precision

In each case, we are representing the value

(-1S) × (1 + fraction) × 2exponent value The value 0 is a special case, since it can't be normalized. 0 is represented by a word of all-0 bits. 20.1.2 (Display page) Representation of the exponent Representation of the exponent

An interesting design decision was made regarding the exponent. Not all computers in the 1980's implemented floating-point operations in hardware, so the designers wanted a representation that was usable even in the absence of floating-point hardware. An example is sorting records that include floating-point values; if possible, we would like the integer comparison of two floating-point values to produce the same result as a specialized floating-point comparison does. Consider, however, the values 1/2 = 2-1 and 2 = 21. If we store the exponent as a 2's complement number, these values appear as follows (the 1 in parentheses is the "hidden bit", the implicit top bit of the mantissa):

1/2 = 0 11111111 (1) 00000000000000000000000 2 = 0 00000001 (1) 00000000000000000000000 In an integer comparison, 1/2 would appear to be bigger than 2. This was undesirable. Instead, the designers chose biased notation, where the most negative exponent is represented as 00 ... 00 and the most positive as 11 ... 11. The difference between 11...11 and the most positive exponent is called the bias. The IEEE standard uses a bias of 127 (the number of representable values in 8 bits, divided by 2, minus 1) for single precision and 1023 (computed similarly, for an 11-bit exponent) for double precision. Thus we add 127 to the actual single-precision exponent to get the stored exponent. Summarizing, the value represented by a floating-point number is

(-1S) × (1 + fraction) × 2exponent-bias 20.1.3 (Display page) Examples Examples

Here are two examples, one converting a decimal value to single-precision floating point, the other converting in the opposite direction. Converting 0 01111111 10101010100001101000010 to decimal.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 89 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

1. The sign is 0, so the number is nonnegative. 2. The stored exponent is 01111111 base 2 = 127 base 10. The bias adjustment is 127 - 127 = 0. 3. The fraction doesn't need shifting since the exponent is 0. 4. The fraction is 1 (the "hidden" bit) + 1×2-1 + 0×2-2 + 1×2-3 + 0×2-4 + 1×2-5 + 0×2-6 + 1×2-7 + 0×2-8 + 1×2-9 + 0×2-10 + 0×2-11 + 0×2-12 + 0×2-13 + 1×2-14 + 1×2-15 + 0×2-16 + 1×2-17 + 0×2-18 + 0×2-19 + 0×2-20 + 0×2-21 + 1×2-22 + 0×2-23 = 1 + 2-1 + 2-3 + 2-5 + 2-7 + 2-9 + 2-14 + 2-15 + 2-17 + 2-22 = 1.0 + 0.666115 = 1.666115.

Converting -2.340625 × 101 to binary.

Denormalize, getting -23.40625. Convert the integer part: 23 base 10 is 10111 base 2. Convert the fractional part: .40625 base 10 is .25 + .125 + .03125 = .01101 base 2. Put the parts together and normalize: 10111.01101 = 1.011101101 × 24. Convert the exponent: 127 + 4 = 131 base 10 = 10000011 base 2. Put it all together, omitting the top bit of the normalized mantissa: 1 10000011 01110110100000000000000

20.1.4 (Brainstorm) Another example Convert

0 1000 0000 001 0000 0000 0000 0000 0000 to decimal. Explain how you did it. 20.1.5 (Brainstorm) Yet another example Convert 134.0625 to binary. Explain how you did it. Convert it to IEEE 754 Single Precision floating point. 20.1.6 (Display page) The Floating-Point Representation tool The Floating-Point Representation tool

MARS provides a "Floating-Point Representation" tool, accessible from the Tools menu. It allows easy experimentation with floating-point values, converting either from binary to decimal or vice versa. Take some time now to try it out on the exercises in the previous two steps. 20.1.7 (Display page) Casting from float to int and back Casting from float to int and back

Uses of type casting that we've seen earlier in this course all have involved reinterpreting a given sequence of bits, say, as a char instead of an int, or as one kind of pointer rather than another. A cast from a float to int, however, converts the float value to an integer, truncating any fractional part. Similarly, a cast from an int to float produces the float values that's closest to the int. Consider now the statement

float f; ... if (f == (float) ((int) f)) { printf ("true"); }

This will fail to print "true" for any float value in f that has a fractional part. 20.1.8 (Brainstorm) Casting from int to float and back Describe the values of k for which

int k; ... if (k == (int) ((float) k)) { printf ("true"); } will not print "true". 20.1.9 (Display page) Representing 0.3 Representing 0.3

Convert 0.3 to the nearest IEEE single-precision floating-point value. Check your answer using the MARS directive

f: .float 0.3 or by using the MARS Floating-Point Representation tool. 20.1.10 (Brainstorm) Difference between 0.3 and its representation? Let x be the floating-point approximation for 0.3 that you just got from MARS. In mathematical terms, what is 0.3–x? Briefly explain. 20.1.11 (Display page) Float representation from C Float representation from C

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 90 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Wonder how you might build a floating point representation tool in C? You have to work pretty hard to peek at the binary representation underlying FLOATs. You can't just cast them to an int, because the C compiler will insert code to do the conversion. You have to trick it by just pointing to it. Here's an example.

#include #include

void printbits(unsigned int *bits) { int i; for (i=31; i>=0; i--) { printf("%d", (*bits >> i) & 1); } }

int main(int argc, char *argv[]) { int i = atoi(argv[1]); float f = i; printbits(&i); printf("\n"); printbits((int *)&f); printf("\n"); } This is another tool that you can use. 20.1.13 (Brainstorm) Successor values The value 1,000,000 can be exactly represented in IEEE single-precision floating-point format. What is the next larger floating-point value (not necessarily an integer) that can be represented in IEEE single-precision floating-point format? Give both its value and its bit pattern, and explain how you got it. (MARS and its Floating-Point Representation tool may be helpful here.) 20.1.14 (Display page) Addition Addition

Floating-point addition works like addition in scientific notation, as we saw earlier. "Adjust one so that they have a common exponent; add the significands; renormalize." Here, for example, is how 2.25 and 134.0625 are added.

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 134.062510 = 0 10000110 (1) 000 0110 0001 0000 0000 00002 We shift the value with the smaller exponent so that the exponents match, thereby denormalizing it. The exponents differ by 6, so

2.2510 = 0 10000000 (1) 001 0000 0000 0000 0000 00002 becomes

0 10000110 (0) 000 0010 0100 0000 0000 0000 Observe that the "hidden" bit is now 0 because the value is no longer normalized. (The hidden bit is materialized while we're doing operations. It can be hidden again when we're all done.) Now we add:

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 000 0010 0100 0000 0000 0000 = 0 10000110 (1) 000 1000 0101 0000 0000 0000 The answer is already normalized. In general, however, it will need renormalization. Here, for example, is the addition of 2.25 to 255.0625.

0 10000110 (1) 000 0110 0001 0000 0000 0000 + 0 10000110 (0) 111 1111 0001 0000 0000 0000 = 0 10000110 (10) 000 0101 0010 0000 0000 0000 The sum has overflowed past the hidden bit. Thus the fraction must be shifted right to renormalize it, giving

0 10000111 (1) 000 0010 1001 0000 0000 0000 20.1.15 (Display page) The MIPS floating-point coprocessor The MIPS floating-point coprocessor

MIPS has special instructions for floating-point operations: single precision double precision add.s add.d sub.s sub.d mul.s mul.d div.s div.d As one might guess, these are more complicated than their integer counterparts, requiring extra hardware and usually taking much longer to compute. How should they be implemented relative to the main CPU? Here are some issues:

It's not a good idea to have instructions that take significantly different amounts of time. (We'll talk about this more when we discuss pipelined architecture.) Many programs do no floating-point computation at all, so a cost-benefit calculation for the extra floating-point hardware needs to be made. Generally, it is rare for a particular piece of data to be used both as an integer and as a

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 91 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

floating-point value. Thus only one type of instruction will be used on typical data.

The solution in the 1990's when MIPS was designed was to have a completely separate chip—a coprocessor—that specializes in floating-point computation. On the MIPS, it's called coprocessor 1. The MIPS architecture allows for four coprocessors; we'll meet coprocessor 0 later in the course, when we explore how input/output operations are handled. Another example of a coprocessor is the "Vector Engine" used in Sony's PlayStation 2. These days, floating-point hardware is probably included on the main CPU chip; alternatively, cheap chips may omit it completely and emulate floating-point operations in software. The MIPS floating-point coprocessor contains 32 32-bit registers named $f0, $f1, etc. The .s and .d instructions listed above refer to these registers. By convention, an even/odd pair ($f0/$f1, $f2/$f3, ...) of registers contains one double-precision value. Single-precision computation is done only in even-numbered registers. There are separate load and store instructions: lwc1 ("Load Word into Coprocessor 1") and swc1 ("Store Word from Coprocessor 1"), which work similarly to lw and sw except that they load into and store out of the specified floating-point registers. The program in ~cs61cl/code/floatadd.s provides some examples of floating-point loads, stores, and addition. 20.1.16 (Display page) Finding an x that you can't add one to Finding an x that you can't add 1 to

Find a positive floating-point value x for which x+1.0=x. Verify your result in a MIPS assembly language program, and determine the stored exponent and fraction for your x value (either on the computer or on paper). The provided MIPS program ~cs61cl/code/floatadd.s helps you experiment with adding floating point values. It leaves the sum in $f12 and also $s0, so you can examine the hex representation of the floating point value by printing $s0. 20.1.17 (Brainstorm) Finding the smallest such x Now find the smallest positive floating-point value x for which x+1.0=x. Again, determine the stored exponent and fraction for x. Explain how you found it, and why this you're sure it's the smallest value. 20.1.18 (Brainstorm) Floating-point associativity Finally, using what you have learned from the previous two steps, determine three positive floating- point numbers such that adding these numbers in one sequence yields a different value from adding them in another. Briefly explain how you found these numbers. (Hint: Experiment with adding up different amounts of the x value you determined in the previous step, and the value 1.0). This shows that for three floating point numbers a, b, and c, a+b+c does not necessarily equal c+b+a. 20.1.19 (Brainstorm) Floating-point multiplication As the old saying goes, 1 and 1 is 10 (in binary). When we add together two n-bit integers the result may be an n+1 bit number. When we multiply two n-bit numbers we have a 2n-bit result. So with multiplication we quickly run out of integer range. Also, you've learned the rule for floating-point addition. It is essentially the rule for addition in scientific notation. State the rule for multiplication in scientific notation. That's probably how floating-point multiplication works. 20.2 Explore the handling of computational errors.(9 steps) 20.2.1 (Display page) Overflow Overflow

Overflow results from producing a value too large in magnitude to be represented in the computer. For example, adding 1 to the largest representable integer produces overflow. Overflow is detected in the hardware, and causes an exception—similar to Java's exception construct—that code may be provided to handle. MIPS also provides a way to ignore overflow, namely, to use unsigned operations. P&H note that unsigned integers are commonly used for memory addresses where overflows are ignored. For that reason, the unsigned arithmetic instructions addu, addiu, and subu do not cause exceptions on overflow, while overflow can be caused by add, addi, and sub. 20.2.2 (Display page) Overflow in C C and overflow

Interestingly, all overflow in integer arithmetic is ignored in C. Verify that C ignores integer overflow in two ways:

1. Add 1 to the largest possible positive signed integer, and print the result. 2. Go back to the step "A look ahead at assembly language" in the activity "Explore the programming environment" on the first day of class for directions on using the mips-gcc command. Using the -O0 (letter O and number zero) option, generate the assembly language equivalent of your program for part 1 and examine the arithmetic instructions it uses.

20.2.3 (Brainstorm) Confusion about overflow A CS 61C (note: not 61CL!) student claims that adding 0xffffffff to itself produces overflow. Suggest how the student might be confused, and provide an explanation to set the student straight. 20.2.4 (Brainstorm) Overflow detection P&H note that MIPS can detect overflow via exception, but unlike many other computers it provides no conditional branch to test overflow. The following code does this. Comment the code. (Overflow is indicated by a carry into the sign bit that produces the wrong sign.)

addu $t0,$t1,$t2 xor $t3,$t1,$t2 slt $t3,$t3,$0 bne $t3,$0,noOverflow xor $t3,$t0,$t1 slt $t3,$t3,$0 bne $t3,$0,overflow 20.2.5 (Display page) Floating-point overflow and underflow http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 92 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Floating-point overflow and underflow

Floating-point overflow essentially arises from producing an exponent that's too big for the allotted space (8 bits for single-precision values). There is an analogous condition called underflow, resulting from a negative exponent that's too large to be represented. These two situations are illustrated in

the diagram below. Floating-point overflow and underflow on MIPS raise exceptions, in conformance with the IEEE standard. An interesting problem arises with normalized floating-point numbers, namely, an empty gap around 0. Consider:

-126 -126 The smallest positive representable number, call it x1, is equal to 1.0 ... 2 × 2 = 2 . (Recall that 0 has a zero exponent, so the smallest possible biased exponent for a normalized positive number is 1.) The second smallest representable positive number, call it x2, is

-126 1.000 ... 012 × 2 -126 = (1 + 0.000 ... 012) × 2 = (1 + 2-23) × 2-126 = 2-126 + 2-149

-149 -126 The distance between x1 and x2 is small, 2 . The distance between x1 and 0 is 2 — more than 8,000,000 times greater! The diagram below illustrates this problem. Normalization

and the implicit 1 are to blame.

The solution in the IEEE standard is to allow denormalized values in the gap. A denormalized value has a 0 exponent but a nonzero mantissa. All denormalized values have the same implicit exponent, –126, but they do not have the leading 1. The smallest denormalized value is 2-149; the next smallest is 2-148, closing the gap as shown below.

20.2.6 (Display page) IEEE 754 special cases IEEE 754 special cases

In the days before IEEE 754, computers had a variety of responses to numerical errors. It was common, for instance, to replace an underflow by 0. More bizarre actions are described in William Kahan's history of the IEEE 754 standard. Aside from the variability among platforms, sometimes the actions taken by a given system were mathematically inappropriate. Thus, two principles underlying the IEEE floating-point standard are detection and control. The programmer should be allowed to detect exceptional cases, and then should also have the ability to respond to them if appropriate. To do that, the standard introduced three new constants:

∞ and –∞, the result of dividing by 0; Nan ("Not a Number"), the result of, say, computing the square root of a negative number or computing 0/0.

Infinity is a mathematically useful construct. NaN are defined to "contaminate" a computation; that is, op(NaN, X) = NaN for any mathematical operation, so it is a good tool for debugging computations gone awry. All these constants are represented in the IEEE standard by values with the most positive exponent. For ±∞, the mantissa is all zeroes. For NaN, the mantissa is nonzero. The IEEE standard also requires that these constants be detectable via exceptions: an invalid operation (resulting in NaN and division by zero. As with any exception, the programmer can choose to handle it on his/her own if appropriate. The table below gives all the cases for the IEEE representation. exponent mantissa object 0 0 0 0 nonzero denorm 1-254 any ± floating-point value 255 0 ±∞ 255 nonzero NaN 20.2.7 (Display page) Roundoff error Roundoff error

Roundoff error, as we have noticed, is the result of the inability to represent a value exactly because of insufficient computer precision. In particular, we have seen examples where a relatively small value was essentially discarded when adding it to a relatively large number. An obvious strategy for problems that require adding a large number of terms is to add the smaller ones first. The program in ~cs61cl/code/sum.c adds the values 1/k2 for k = 1, 2, .... (This sum should converge to π2/6.) Provide code that adds the sum terms in the opposite direction, and see (a) how many terms you can add, and (b) what difference it makes to the computation.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 93 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

20.2.8 (Display page) Rounding and the IEEE standard Rounding and the IEEE standard

The IEEE standard requires that an exception be available for inexact result (i.e. the occurrence of roundoff). This exception is usually ignored. The standard also prescribes four kinds of rounding, selectable under program control.

round toward +∞ (round "down"); round toward –∞ (round "up"); round toward 0 (truncate); round to nearest representable value, and toward 0 if the approximation is exactly halfway between two representable values.

Rounding to the nearest representable value is the default. This rounding mode tends to balance out inaccuracies. 20.2.9 (Display page) Programmer concerns Programmer concerns

Unfortunately, software developers have not kept pace with the hardware designers. Java, for example, does not provide any way to detect or handle a floating-point exception, or to change the rounding mode. In general,

"Programming languages new (Java) and old (Fortran), and their compilers, still lack competent support for features of IEEE 754 so painstakingly provided by practically all hardware nowadays. S.A.N.E., the Standard Apple Numerical Environment on old MC680x0-based Macs is the main exception. Programmers seem unaware that IEEE 754 is a standard for their programming environment, not just for hardware." (W. Kahan, history of IEEE 754)

Java has other problems as a language for numerical computation; see this article by William Kahan for details. Finally, if you're interested in techniques for minimizing computational error, Math 128A is a good course to take. The prerequisites are Math 53 and 54. Here's the catalog description.

Programming for numerical calculations, round-off error, approximation and interpolation, numerical quadrature, and solution of ordinary differential equations. Practice on the computer.

20.3 Homework(1 step) 20.3.1 (Display page) Upcoming topics Upcoming topics

Activities for lab on Thursday and Friday will involve the process of going from C code to an executable program, in particular, the roles of the assembler and the linker. Relevant readings are P&H sections 2.12 and B.1-B.3 (fourth edition) and sections 2.10 and A.1-A.3 (third edition). 21 Interpreters and linkers 2009-3-05 ~ 2009-3-06 (5 activities) 21.1 Practice with the CAL16 machine language(11 steps) 21.1.1 (Display page) Translators and interpreters Translators and interpreters

Thus far this semester you've worked with a C compiler (gcc), which translates C to assembly language, and a MIPS assembler and interpreter (MARS) which translates a MIPS assembly language program to MIPS machine language and then executes the translated program. Finally, you just finished an assembler for the CAL16 assembly language program. Let's try to put all these into a common framework. In general, if you have a program written in language X, you have two options for getting it run.

One way is to execute it directly, using an interpreter. (Simulator is another word for interpreter.) The computer itself is the interpreter for its own machine language; when you execute a program, you are really running this interpreter. You also used an interpreter in CS 61A to execute Scheme programs. All interpreters have the same form: repeatedly get a statement or construct of the language, and execute it. The other option is to translate the program into something that's easier to execute. Examples are gcc, which translates a C program first into assembly language and then into machine language for a target computer, and javac, which translates a Java program into the machine language for the Java virtual machine (which is then interpreted).

In the next few activities, we'll be examining and using an interpreter for the CAL16 machine language to run the programs output by your CAL16 assembler. We'll also discuss the linker, a key component in contructing executable files. 21.1.2 (Display page) A CAL16 interpreter The CAL16 interpreter

The files in ~cs61cl/code/CAL16.sim comprise an interpreter for the CAL16 machine language. The file CAL16.mem.c contains the code for accessing memory; the file CAL16.sim.c implements the fetch-execute loop. Copy the directory, then type "make cal16" to produce a runnable version of this interpreter. cal16 takes one command-line argument, the name of a "machine language" program to be executed. (Your assembler output the program in a .o file.) After executing an instruction, cal16 prints the updated register and memory values, then waits for you to type the return key. It then

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 94 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

executes the next instruction and prints registers and memory again. To stop execution, type 'q' followed by a return rather than just a return. Either by hand or using your assembler, write a two- statement program that copies a nonzero valule into register $5, then doubles it. Run cal16 on that program and see what happens. 21.1.3 (Brainstorm) Execution behavior Did cal16 stop after executing the second statement? Why or why not? 21.1.4 (Display page) cal16 in more detail cal16 in more detail

The next few steps involve questions about the cal16 code. You may also wish to review the project 1 specification, especially the instruction summary at the end. 21.1.5 (Brainstorm) 0xffff? Why do most of the instructions involve "& 0xffff"? 21.1.6 (Brainstorm) bneq coding? Why not, in the handling of the bneg instruction, merely say "if (r[destIndex] < 0)"? 21.1.7 (Brainstorm) Effect of misalignment? The MIPS computer generates an error if an attempt is made to load a word from an unaligned location. Does cal16 do this? Explain why or why not. 21.1.8 (Display page) Simulating a MIPS la Simulating a MIPS la

Suppose that value labels a .data instruction. Give the CAL16 instructions that perform the CAL16 equivalent of the MIPS assembly language instruction

la $7,value

After assembling your code (either by hand or using your assembler), test it with cal16. 21.1.9 (Brainstorm) jr Describe in your own words how the jr instruction works. In particular, how do you use it to call a subroutine, and how to return? 21.1.10 (Brainstorm) Halting with jr? Give a jr instruction that halts the program currently being executed, and explain your answer. 21.1.11 (Display page) A subroutine call A subroutine call

Complete and test the program below. It should load the address of the location labeled by value into register $2, jump to update (with $15 containing the address of the jump), receive an updated value in $1, and store it back into the word labeled by value.

jmp main; # always start at main

update: # the update subroutine bneg $2 decrement; # return the value that's increment: # 1 further from 0 addi $1 $2 1; jr $0 2($15); decrement: addi $1 $2 -1; jr $0 2($15);

main: # Load what's labeled by value into $2. # (You fill this in.)

# Load address labeled by update into $14, # and then call update. # (You fill this in.)

st $1 0($3); # store updated value back into update # Halt. (You fill this in.) value: .data 3; 21.2 Linkage editing(8 steps) 21.2.1 (Display page) Rearranging code Rearranging code

Interchange the update subroutine and the main program, as shown below. Verify that it still works.

main: # Load what's labeled by value into $2. # (You fill this in.)

# Load address labeled by update into $14, # and then call update. # (You fill this in.)

st $1 0($3); # store updated value back into update # Halt. (You fill this in.) value: .data 3

jmp main; # always start at main

update: # the update subroutine bneg $2 decrement; # return the value that's increment: # 1 further from 0 addi $1 $2 1; jr $0 2($15); decrement:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 95 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

addi $1 $2 -1; jr $0 2($15);

21.2.2 (Display page) Assembling modules separately Assembling modules separately

Now separate the update function from the main program as shown below. Run each program segment separately through your assembler, concatenate the .o files, and run the resulting program through cal16. Keep track of what happens. main: # Load what's labeled by value jmp main; # into $2. # (You fill this in.) update: bneg $2 decrement; # Load address labeled by update increment: # into $14, and then call update. addi $1 $2 1; # (You fill this in.) jr $0 2($15); decrement: st $1 0($3); addi $1 $2 -1; # Halt. (You fill this in.) jr $0 2($15); value: .data 3; 21.2.3 (Brainstorm) Result of execution? What was the result of interpreting the program formed by concatenating the two .o files? Briefly explain. 21.2.4 (Brainstorm) Fixes? The program that resulted from concatenating the two .o files didn't run correctly. What instructions in the program need to be fixed to make the program work as intended? 21.2.5 (Display page) The linker to the rescue The linker to the rescue

The situation we have just noticed is the following. We have two separate modules that are being translated separately. (This situation is typical when the modules are large, or part of a much bigger program.) Neither module can yet know addresses of functions or data that it needs to access in the other module. Moreover, the two modules need to be laid out sequentially in memory. Addresses in the second module are computed by the assembler assuming that the module starts at address 0, which is not the case. A program called the linker solves these problems. A linker for CAL16 programs takes a collection of .o files and the corresponding symbol tables, and produces a single executable file using the following steps.

1. The assembler has assumed that the first instruction in each .o file starts at address 0. The linker takes the .o files and concatenates them, keeping track of the starting address of each individual segment. 2. Recall now that each entry in the symbol table contains a symbol, its value, and a collection of type/reference pairs for that symbol. The linker now finds the actual address of each symbol, and updates the corresponding value in the combined symbol table. 3. Finally, the linker goes through the references to each symbol. Each reference is associated with a jmp, lhi, or llo instruction. For each reference, the linker replaces the corresponding instruction by an instruction in which the address part has been updated.

The instructions that the linker replaces are highlighted in boldface in the code below. main: lhi $3 value; rotr $3 $3 8; jmp main; llo $7 value; or $3 $3 $7; update: ld $2 0($3); bneg $2 decrement; lhi $14 update; increment: rotr $14 $14 8; addi $1 $2 1; llo $7 update; jr $0 2($15); or $14 $14 $7; decrement: jr $15 0($14); addi $1 $2 -1; jr $0 2($15); st $1 0($3); jr $0 1($0); value: .data 3; 21.2.6 (Brainstorm) Fixing a llo instruction Suppose that the main and update modules were laid out in memory in the opposite order, as shown in the diagram below. main: lhi $3 value rotr $3 $3 8 jmp main llo $7 value or $3 $3 $7 update: ld $2 0($3) bneg $2 decrement lhi $14 update increment: rotr $14 $14 8 addi $1 $2 1 llo $7 update http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 96 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

jr $0 2($15) or $14 $14 $7 decrement: jr $15 0($14) addi $1 $2 -1 jr $0 2($15) st $1 0($3) jr $0 1($0) value: .data 3 One of the instructions modified by the linker is the machine instruction corresponding to llo $7 value. What change does the linker make to this instruction? List the exact contents of the new instruction, and briefly explain how you got it. 21.2.7 (Brainstorm) Relocate jr? Why do jmp instructions get relocated but not jr instructions? 21.2.8 (Brainstorm) Relocate branches? Why are bz and bneg instructions not relocated? 21.3 Linking on the MIPS(7 steps) 21.3.1 (Display page) Linking Linking

The MIPS linker takes one or more object files as input. An object file contains several sections, including the following:

the .text section, where the assembler has placed the program's machine instructions; the .data segment, a binary representation of the data in the source program; the symbol table, which contains information about names defined in this module and accessible in other modules; the relocation table, which contains information to be used by the linker to fix up instructions that require absolute addresses in the .text or .data segment.

The last two were combined in the .syms files in project 1. Also there is no .data segment in the CAL16. The job of the linker is to arrange the various object modules in sequence—that is, to relocate them—then to turn all the guesses about addresses made by the assembler into actual addresses. (The linker was once called a "linkage editor" because it involves editing the "links" in jump-and-link instructions.) The diagram below illustrates the role of the linker.

Here are the detailed steps.

1. The assembler has assumed that the first instruction in each .text segment starts at address 0. The linker takes the .text segments from the .o files and concatenates them, keeping track of the starting address of each individual segment. 2. The linker does the same for the .data segments, concatenating them into a single .data segment and keeping track of the starting address of each individual segment. 3. The linker then goes through all the individual relocation tables, which contain addresses of instructions or data values to fix up, and does the fixing.

21.3.2 (Display page) Things to fix up Things to fix up

Addresses may appear in memory in a variety of contexts. For instance, a jump table in the .data segment may consist of addresses of jump destinations, as in the example below for simulating the dice game of Craps:

jumpTable: .word lose # 2 .word lose # 3 .word trypoint # 4 .word trypoint # 5 .word trypoint # 6 .word win # 7 .word trypoint # 8 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 97 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

.word trypoint # 9 .word trypoint # 10 .word win # 11 .word lose # 12

The linker would have to replace all those table entries by the absolute addresses of the lose, win, and trypoint functions. More often, however, the fixes are to instructions in the .text segment. These result from jumps and address loads. The last 26 bits of every j and jal instruction must be replaced by the corresponding bits of the destination absolute address. Similarly, the instructions that comprise a la, usually a lui and an ori, must be edited; the immediate fields in each are updated to access the correct absolute address. 21.3.3 (Display page) Dynamic linking Dynamic linking

What we've described is the traditional process, a "static" linking approach. If library routines or data were accessed, a copy of the entire library is now part of the executable file, even if not all of it were used. Moreover, updates to the library don't get propagated to programs that use it. On the other hand, the executable file is self-contained. An alternative is dynamically linked libraries (DLLs), common on Windows and UNIX platforms. These are accessed at run time rather than at linking time. Wikipedia contains a good description of dynamic linking. Dynamically linked libraries add significant complexity to the compiler and operating system. However, they provide a number of benefits:

1. An executable file takes up less space on the disk. 2. Copying a program to another computer takes less time. 3. If two programs share the same library, they collectively need less memory to run. 4. Replacing one file in a library upgrades every program that uses that library.

21.3.4 (Brainstorm) Disadvantages of dynamic linking? List one or two disadvantages of dynamic linking. 21.3.5 (Display page) Loading Loading

The loader, generally a component of the operating system, loads an executable file into memory and starts it running. Here's what's involved:

It reads the executable file's header to determine the size of its text and data segments. It creates a new address space for the program, large enough to hold text and data segments and a stack segment. It then copies instructions and data from the executable file into the new address space. it copies the arguments passed to the program onto the stack. It initializes the machine registers. Most get set to 0; the stack pointer is assigned the address of the current top of the stack. It jumps to a startup routine that copies the program's arguments from stack to registers and calls main. When main returns, the startup routine terminates the program using the exit system call.

21.3.6 (Display page) An example An example

The Weiner Lecture Archives site contains an excerpt from a previous CS 61C lecture discussing an example of the compiling/assembly/linking/loading process. To access it, follow the link to wla.berkeley.edu, go to CS 61C, click on the triangle by "Programs", then on the triangle by "Bringing Code to Execution". Finally, click on "Example". This might not work in the lab, depending on what plugins have been installed for FireFox. If it does work in the lab, please use headphones. 21.3.7 (Display page) Some things you should wonder about Some things you should wonder about

Reproduced below is a diagram showing the memory layout of a program. According to the diagram, the program starts at address 0x00400000, the stack starts at 0x7FFFFFFF, and the heap and static

data are somewhere in between.

We said previously that the linker starts the .text segment at 0. What's going on with http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 98 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

0x00400000? Doesn't that require patching the executable again? The range of addresses 0x00400000 to 0x7FFFFFFF spans more than a gigabyte of memory. What if that much memory is not installed on the computer? If all programs start at 0x00400000, how are multiple programs running at the same time managed?

All of these questions will be answered when we cover virtual memory in a few weeks. 21.4 Compare translation and interpretation.(4 steps) 21.4.1 (Display page) Interpretation vs. translation Interpretation vs. translation

Interpretation has a number of advantages over translation.

Interpreters are usually easier to write than compilers, since an interpreter doesn't have to worry about the details of a compiler's target language. Since an interpreter is dealing with a source language program at a high-level, it can provide a more informative error message when some aspect of the interpreted program is incorrect. Informative messages are more difficult to produce at the machine level. (For example, you perhaps you have seen the message "Segmentation error"?) Most compilers have options to produce more information about the translated program that, say, a debugger might use to be more helpful. Interpreters are platform-independent. An interpreter written in (say) Java can be run on any computer providing a Java interpreter. A compiler necessarily must know the details of the target language; retargeting the compiler to produce code another language will require substantial revision. A program written in a high-level language is likely to be much smaller than its machine language counterpart. (Contrast the size of a C program with the size of the corresponding a.out file.) Code size becomes an issue when a program is run over a network, where one wishes to minimize the amount of data transmitted.

A big disadvantage of interpreted code versus compiled code is that it's probably much slower. Many applications require the speed of compiled code. (Note, however, that speed is less important for early program prototypes. Numerous applications have been first written in a language like Lisp in interpreted mode, then translated to a lower-level language—yes, there are Lisp compilers—for final implementation. An example is described here.) 21.4.2 (Display page) Case studies Case studies

Throughout computing history, there have been examples of the use of interpreters to simplify changes in computer architecture. Some examples:

IBM moved from the 7000 series of computers to the architecturally quite different System/360 series in the early 1960's. The IBM 7000 was quite popular. Seeking to minimize trauma for its users, IBM provided in the System/360 an emulator that interpreted 7000 instructions. Programs written for the 7000 didn't have to be rewritten or even recompiled; they could be run directly in "7000 mode". The Apple Macintosh has been through two dramatic changes in CPU architecture. First was the move from the Motorola 680x0 chip to the IBM/Motorola Power PC in the 1980's; more recently came the switch to the Intel x86 architecture. Rather than require that all old software be recompiled or, worse, rewritten to work on the new architecture, Apple provided emulators that detected old code and allowed it to be run unmodified. Suppose you fondly remember playing fun computer games on an old computer, say, the Apple II or Commodore PET. Are these games lost forever? Not if you have an emulator for the relevant computer architecture. A Google search for "computer emulation" produces lots of information about emulators for out-of-production computers.

21.4.3 (Brainstorm) Java design Java was originally designed to be an embedded language in consumer electronics. Which of the following advantages of interpretation do you think was most important for the Java design?

1. An interpreter is easier to write than a compiler that produces machine language as output. 2. An interpreter can produce better error messages than compiled code can. 3. Program code intended to be interpreted is smaller in size than its machine language equivalent. 4. An interpreter can be run on a variety of computers; a compiler producing machine code would need to be rewritten for each computer.

Briefly explain your answer. 21.4.4 (Display page) Implementing a language processor Implementing a language processor

Suppose we have designed a new language named SMIP, and we would like to implement a compiler and an interpreter for it on the Lemon computer. This is not too hard if there are high-level languages available on the Lemon with which to write a compiler or interpreter for SMIP. If there is no suitable language on the Lemon for writing a language processor, but there is a suitable language on the Grape computer and files can be transferred between the two, we can write a cross compiler on the Grape that translates SMIP to Lemon machine code. But what if we only have an assembler on the Lemon? We can start by writing, in assembler language, an interpreter for a strategically chosen subset of SMIP. We choose to write an interpreter because it's easier to http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 99 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

implement than a compiler. We choose to interpret a subset because we eventually intend to write an interpreter for SMIP in SMIP itself. We now can interpret programs written in a subset of SMIP. In particular, with a suitable subset, we can write a SMIP-to-Lemon machine language translator in that subset. Then we can provide the translator to itself as input, and get a compiler for SMIP that runs natively on the Lemon. Neat, yes? Here are the steps in diagram form. The general situation

A special program to be assembled

A special program to be run

Normal input for that program

Special input for that special program

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 100 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The output

The last two steps

21.5 Homework (not complete)(2 steps) 21.5.1 (Display page) Homework Homework

By the start of your next lab, post a question about the project 2 specification in the following discussion step. The question should be one that either you have or you suspect that one of your classmates may have. (It shouldn't match any of the previously posted questions.) Then, in lab on Tuesday or Wednesday, post an answer to one of the questions. More homework, a short set of exercises, will appear over the weekend. 21.5.2 (Discussion Forum) Project 2 Q/A

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 101 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

By the start of your lab on Tuesday/Wednesday, post a question about project 2. In that lab, answer one of the questions posted by your labmates. 22 Mid-semester survey 2009-3-10 ~ 2009-3-11 (1 activity) 22.1 Survey questions(2 steps) 22.1.1 (Display page) Some background about filling out the survey questions Some background about filling out the survey questions

Some of the questions on the survey in the next step involve assigning numeric ratings to a list of items we provide. We would like you to input the ratings one per line, without extraneous text, in the same order as the items you're rating. Here's a sample question of this type.

Rate each of the following on a scale from 4 (highly valuable) to 1 (useless) according to how valuable they have been for improving your understanding of CS 61CL topics:

a. lab exercises b. homework exercises c. "brainstorm" questions d. face-to-face collaboration with other students in lab

If you have no opinion about one of the activities above, rate it as 0.

Suppose that you thought lab exercises were really valuable, homework wasn't very useful, brainstorms were somewhat valuable, and interacting with your labmates was useless. You would then type the following into the text box:

4 2 3 1 We chose this more compressed question format (instead of a separate question for each activity) in order to make the survey easier for you to take. 23 Circuits 2009-3-10 ~ 2009-3-11 (3 activities) 23.2 Work with circuits, the basic building blocks of the computer.(16 steps) 23.2.1 (Display page) Background Background

We are about to descend to the circuit level, in preparation for examining the central processing unit and other computer components. Integrated circuits are laid out on a bare die that's primarily crystalline silicon. The width and length of the die range from 1 to 25 millimeters. There are roughly 100 million to a billion transistors on the die, each around 65 nanometers (65 × 10-9 meters) in size. They are organized into logic gates, roughly four transistors per gate; the die thus contains between 25 to 100 million gates. There are several conductive layers on the die in which signals travel from place to place on the die. A chip contains a die, mounted on a ceramic or plastic base. The connections on the die are much too close together to be accessible from outside the die, so the ceramic base provides a way of spreading them out. It also allows heat to be dissipated. (Power consumption and the resulting generation of heat are the big constraints on performance.) A couple of examples of real dies appear below. One is a relatively old (late 1990's) PowerPC; the other is a

more recent Pentium. Labels on the photographs identify functional units on the dies:

the instruction fetcher and sequencer; the handler of loads and stores; the arithmetic and logical unit; the floating-point coprocessor; the caches for instructions and data.

In two or three weeks, we will see how all these units are organized. For example, the MIPS CPU is diagrammed below.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 102 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Its building blocks—adders, multipliers, shifters, and selectors—are all made of the logic gates that we'll shortly be studying. 23.2.2 (Display page) Representations of combinational logic circuits Representations of combinational logic circuits

Any circuit has one or more inputs and one or more outputs. In a combinational logic circuit, the output values are based entirely on the input values. A state element, on the other hand, has memory; its outputs are based not only on the current input values but on earlier inputs as well. We'll study state elements in an upcoming lab. Here we focus on combinational circuits. They are like mathematical functions; the output depends solely on the input, and there is a beautiful body of theory and practice. The digital abstraction allows us to treat a certain voltage range as representing the logical value 1 or TRUE and below this the value 0 or FALSE. In electrical engineering courses, one learns how to construct circuits that take these digital values as inputs and produce digital values as outputs. In CS 61CL, we work with the digital abstraction—and it is beautiful. A circuit is built of gates and wires. The typical gates that you'll recognize from logic are "and", "or", and "not"; outputs from gates can be connected by wires as inputs to other gates to construct more complex functions from simpler ones. We will shortly see how to construct more complicated circuits out of these three primitives. Right now, we will look at three ways of representing a circuit.

For each pattern of 1's and 0's in the inputs to a combinational logic circuit, there is a unique output value. A truth table lists all the possible input values along with the corresponding output values. There is a column in the table for each input and for each output. There is a row in the table for each pattern of 1's and 0's in the input. Listed below as examples are the truth tables for "and", "or", and "not"; in each case, the inputs are designated by a and b. and or not a b a and b a b a or b 0 0 0 0 0 0 a not a 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 A circuit can also be represented graphically with a symbol. The conventional graphical representations for "and", "or", and "not" are shown below. Graphical structures are convenient for conveying the structure of a circuit. These logic gate symbols are standardized and used internationally. We'll show you the others later. The little circle on the end of a wire indicates negation.

Finally, a circuit can be represented algebraically using Boolean algebra. Boolean variables, as you know from programming, can take on two values, true and false. We will use the operator "⋅" for "and", "+" for "or", and "!" for "not". (The readings use an overbar for "not", but we are unaware of any way to apply an overbar to a character or expression in HTML.) An example expression would then be http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 103 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

a ⋅ (!b + c) which is read as "a and the quantity not b or c".

23.2.3 (Display page) Inventing xor Inventing xor

The "exclusive or" function ("xor") is a very useful digital logic gate. You may have already used the xor operator, '^', in C. We'll construct it out of the gates we have seen so far and illustrate all three representations. The "xor" of two Boolean values is 1 if one of the values is 0 and the other is 1, and 0 otherwise. This informal description is made precise in the truth table.

a b a xor b 0 0 0 0 1 1 1 0 1 1 1 0

An algebraic expression defining the xor operator is

x = a ⋅ !b + !a ⋅ b Finally, this expression can be represented with a circuit of gates and wires as follows. This graphical representation is called a schematic.

Note that each of the inputs a and b is wired as input to two gates. This "forking" of input values is quite

common. xor is so important that it has its own logical symbol. 23.2.4 (Brainstorm) nand Using the logic symbols that you have so far, construct the Boolean function nand that's described by

the truth table below. Its logic symbol is .

a b a nand b 0 0 1 0 1 1 1 0 1 1 1 0

Also, give a simple English description that relates nand's inputs to its output. 23.2.5 (Display page) Introduction to Logisim Introduction to Logisim

The Logisim program provides a way to build, visualize, and simulate circuits. (The technical term for this kind of tools is "schematic capture".) In the old days we had little plastic templates and would trace the outlines of logic symbols on paper and draw lines to wire them up, just like drafting the design of a house. Computer Aided Design (CAD) tools, some invented at Berkeley, allowed you to "capture" these designs in a machine-readable form. In CS 150 and CS 152 we use professional tools for this. However, they have a huge learning curve and lots of company-specific features. Logisim is a very basic tool, but it is quick to learn, easy to use, and just enough for our needs. Run Logisim by ssh'ing to cory.eecs as you did to use MARS. Then type

java -jar ~cs61c/bin/logisim.jar or merely

logisim to the UNIX prompt. You can learn about Logisim features beyond the scope of this lab (or download it to use at home) from the Logisim website. Start by creating a very simple circuit just to get the feel for placing gates and wires.

1. Click the AND gate button in the toolbar above the layout area. This will cause the shadow of an AND gate to follow your cursor around. Click once in the main schematic window to place an AND gate. In the little description in the lower left portion of the window, adjust the number of inputs to 2. Notice the orientation is "Facing East". With more complex designs you will need to take care in how you place things on the page. So far it looks like

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 104 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

2. Click the "Input Pin" button. Now, place two input pins somewhere to the left of your and gate so they line up with the little dots on the gate.

3. Click the "Output Pin" button. Then place an output pin to the right of your and gate. Your schematic should look something like this at this point:

4. Click the "Wire" tool button. Click and drag to connect the input pins to the little dots on the left side of the and gate. This may take several steps, as you can only draw vertical and horizontal wires. Just draw a wire horizontally, release the mouse button, then click and drag down starting from the end of the wire to continue vertically. You can attach the wire to any pin on the and gate on the left side. Repeat the same procedure to connect the output (right side) of the and gate to the output pin. After completing these steps your schematic should look roughly like this:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 105 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Here we have also used the File->Save As to save our design. Note that the circuit is already being simulated. The "unknown" value on the output pin has been replaced by a zero value. 0 and 0 = 0. How about that?

5. Click the "poke" tool and try clicking on the input pins in your schematic. Observe what happens, verifying that this circuit does what you think an and gate should do. Compare this to the truth table for and. 6. Good drawings are good documents. Use the "text" tool to add a legend in the corner of your schematic. 7. Expand the Base folder on the lefthand tool panel. Click on the "Label" tool. In the text box, enter the label name. Then click on the wire that you want to associate that label with. You can use the "select" tool to move things around. Now it looks something like this.

Notice how the signals that are 1 light up in the schematic. 23.2.6 (Display page) Implementing XOR as a subcircuit Implementing xor as a subcircuit

In C and Java programs we tackle hard problems by "dividing and conquering". We build helper functions to solve subproblems and then put them together to solve the whole problem. Good design means managing complexity. The same is true for logic design. A schematic can contain subcircuits. Here, we will create a subcircuit to demonstrate its use. Start by creating a new schematic for your work (File->New), then a new subcircuit (Project->Add Circuit). You will be prompted for a name for the subcircuit; call it "xor". In the new schematic window that you see, create a simple xor circuit with two input pins on the left side and an output pin on the right side. Go back to your main schematic by double-clicking main in the circuit selector at the left of the screen. Your original (blank) schematic will now be displayed, but your xor circuit has been stored. Now, single-click the word xor in the list. This will tell Logisim that you wish to add your xor circuit into your main circuit. Try placing your xor circuit into the main schematic. If you did it correctly, you should see a gate with two input pins on the left and one output pin on the right. Try connecting input and output pins to these and see if it works as you expect. It should look something like this.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 106 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

You can poke at the inputs inside the subcircuit to try it out. Notice the magnifier glass shows you which circuit that you are focused on. Pay particular attention to how wires cross. The dot at a T or + indicates that the two wires are connected. The absence of a dot indicates that they are not. A little care in laying out your circuit on the page will save you a ton of time. Now let's instantiate this subcircuit in out main schematic. Double click on main to make it the focus. Now add an xor subcircuit by clicking and dropping, just like you did a gate. Notice the two dots on the left side. That's where we put the input pins. The one on the right is the output pin. You need to be extremely careful in Logisim with the placement of pins in you subcircuits. Hook up inputs and output so that you can try out your new hierarchical circuit. Now you can poke the inputs and see that this is indeed an xor. It should look something like this.

In debugging you need to be able to work up and down the abstractions that you build. Here it is subcircuits. Ctrl-clicking on the xor symbol should pop up a "view xor" option. Selecting that takes you into the xor circuit with the input values being the ones that are present in the main circuit. In our case, it looks like this.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 107 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

One thing about hardware—it is inherently parallel. Everything in the circuit is happening at once. Subcircuits are not "invoked" like functions and procedures in a programming language—they exist in hardware. In simulation, they behave like the actual hardware—it all exists. In fact, let's go back to the main circuit and drop in another xor and another input, so it looks like this.

This is an "odd parity" circuit. Its output is 1 when the number of inputs with value 1 is odd. Try poking its inputs. View the underlying subcircuits. See how they are different? You can use the Simulate pull-down menu to navigate up and down. "Give me a lever and I'll lift the world." Now that we can build big circuits by composing smaller ones, we can build anything. (Well, almost. We still only have combinational elements. We'll need some state too.) 23.2.7 (Brainstorm) 1-bit adder Consider a 1-bit adder that produces a 2-bit result: 0 + 0 = 00, 0 + 1 = 1 + 0 = 01, and 1 + 1 = 10. Let a and b be the bits to be added, with s1 being the leftmost (most signficant) bit of the sum, and s0 the rightmost bit of the sum.

1. Construct the truth table. Hint: It should have four columns and four rows. 2. Identify an expression for s1. 3. Do the same for s0, the least significant bit of the sum.

23.2.8 (Display page) More Logisim: a "half adder" subcircuit More Logisim: a "half adder" subcircuit

Create a new project. The circuit described in the preceding step has a name; it is called a "half adder". (We'll see a "full adder" later.) Add a subcircuit called HA that implements the 1-bit adder in Logisim as a subcircuit. You may use your xor subcircuit. Also implement the following function that http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 108 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

would be part of a "mod 4" counter. The combinatorial logic circuit within the counter would take a 2- bit value as input and produce a 2-bit output representing the next value in the sequence 00, 01, 10, 11, 00, 01, 10, 11, 00, ... Use the xor gate in your solution. Start by writing down the truth table. Express each output as a function of the two inputs. 23.2.9 (Display page) Truth table to Boolean expression Truth table to Boolean expression

Starting from a truth table, we can write a special Boolean expression that matches its behavior, namely the sum-of-products canonical form. A "canonical form" is an important concept in CS that shows up wherever you have many ways of representing the same thing. You probably encountered it first in grade school where you "simplified fractions" by all common divisors in the numerator and denominator. Of course, each rational number has an infinite number of ways it can be represented as a fraction, but this one is unique. You see the same thing in writing polynomials in standard form with exponent decreasing to the right. You saw it with normalized floating point numbers. The sum- of-products (SOP) expression is the or of a set of terms, each term being a product (and) of input variables or their complements. Every row where 1 appears in the output contributes a term in the Boolean expression. Each and term includes all the input variables in either complemented or uncomplemented form. A variable appears in complemented form if it is a 0 in the row of the truth table, and uncomplemented if it appears as a 1 in the row. Here's an example of producing the sum- of-products form for a three-input function. min term a b c y 0 0 0 0 !a ⋅ !b ⋅ c 0 0 1 1 !a ⋅ b ⋅ !c 0 1 0 1 0 1 1 0 a ⋅ !b ⋅ !c 1 0 0 1 1 0 1 0 1 1 0 0 a ⋅ b ⋅ c 1 1 1 1 The sum-of-products expression is

y = !a ⋅ !b ⋅ c + !a ⋅ b ⋅ !c + a ⋅ !b ⋅ !c + a ⋅ b ⋅ c This sum-of-products can be implemented directly in gates or it can be simplified further. (By the way, does this look familiar? Do you see any use for it?) 23.2.10 (Display page) Full adder Full adder

The conversion between truth tables and circuit diagrams is algorithmic, so we can write a program to do it; this is the beginning of Computer Aided Design. There are lots of such design tools, but let's use the simple version built into Logisim on a useful example. A full adder (FA) is a circuit that takes three inputs, a and b, as well as a carry in, cin. It implements (in binary) the addition step that you learned as a child: sum the carry-in (that you wrote at the top of the column) and the two input values to produce a sum (in the same column) and a carry out (that you wrote on top of the next column). The FA just does one column (called a bit slice). It generates two outputs: a sum bit, s, as well as a carry out. All of this is more precisely described by the truth table. (Note that this is really two truth tables, one for cout and one for s that are built on the same inputs.)

a b cin cout s 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 There is a carry out if there are two or more 1's in the input. The sum is 1 if the number of 1's in the input is odd. In Logisim, start new project called adder. Add a subcircuit called FA. In that subcircuit, drop three input pins and two output pins. Pull down Project->Analyze Circuit. Click on inputs; you will see that it has named them a, b, c. Click on outputs; they are x and y. Now click on Table. By clicking the entries in the result column, create a truth table where x is the carry out and y is the sum. It should look like this.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 109 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Click on the Expression tab. See the logical expression for x says formally "are two of the inputs true?" and the logical expression for y says "is the number of true inputs odd?". Now click on the Build Circuit button. It asks you for a circuit name. We could stick this in a new subcircuit, but we'll let it go in FA. Don't check either of the options, just click OK and YES in the verification box. You can close this tool. Go back to the FA circuit. It now contains two beautiful sum-of-products examples. Verify that they correspond to the logical expressions. 23.2.11 (Self Test) Sum of products For a Boolean sum-of-products circuit with n inputs and one output, what is the maximum number of inputs that an and gate may have to accept?

What is the maximum number of inputs that the or gate may have to accept?

If only three of the entries in the result column are 1, how many inputs will the or gate have?

If only three of the entries in the result column are 1, how many and gates will there be?

23.2.12 (Display page) Composing circuits into larger ones: a 4-bit adder Composing circuits into larger circuits: a 4-bit adder.

Go back to main and drop an FA down. See where the pins are? It is good to label your components, but Logisim is a bit weak on this. Grab the edit test tool (A) and click in the middle of the FA symbol. Type "FA". Add input pins above and output pin below and two short wires to the cin and cout without any pins. It should look something like the following. Now you can select the entire region and copy it.

Now you can copy, paste, move and build a larger adder out of FAs. (We find that it works best to copy, click in a blank space to drop the selection, paste, then drag.) Connect an input pin to the carry-in on the right and an output pin on the carry-out. It should look something like this.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 110 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Poke at the inputs. It adds from left to right, just like you do, but in binary. It is called a ripple-carry adder because the carry ripples from left to right. See the power of modular design? Imagine if you had all the gates that are inside the FA appearing in this schematic. In fact, let's go count how many there would be. 23.2.13 (Self Test) Adder gate count How many logic gates does your 4-bit adder contain?

23.2.14 (Display page) Adder truth table size Adder truth table size

If you represented the entire 4-bit adder as a truth table, how many rows and columns would it have? To check your answer, focus on the main circuit and do Project->Analyze Circuit. The mapping between combinational logic, truth tables and Boolean expressions goes in any direction. 23.2.15 (Display page) Propagation Propagation

The "critical path" of this circuit is the carry propagation. See if you can set the inputs so that changing on input to the rightmost FA changes all the carrys and the sums. 23.2.16 (Display page) Basic combinational circuit design Basic combinational circuit design

Congratulations! Now you can design a combinational circuit for any Boolean function, no matter how complex. You can define higher-level building blocks, and you can compose them. From here, it is all a matter of optimization and improvement. 23.3 Readings and homework(5 steps) 23.3.1 (Display page) Readings Readings

This week's material is covered in P&H C.1-3 (fourth edition) and B.1-3 (third edition). 23.3.2 (Display page) Homework, part 1 Homework, part 1

During lab on March 12 or 13, show progress on project 2. Up to 4 homework points will be awarded for this milestone, 1 point for each of the following cases (translated to assembly language) you can produce the correct output for up to a maximum of 4.

int k; char outstring[20]; k = snprintf (outstring, 20, "%d", 5); k = snprintf (outstring, 20, "%u", 7); k = snprintf (outstring, 20, "%x", 9); k = snprintf (outstring, 20, "%c", 'x'); k = snprintf (outstring, 20, "%s", "abc"); k = snprintf (outstring, 20, "%%"); 23.3.3 (Display page) Homework, part 2 Homework, part 2

Submit a solution to this exercise to Expertiza under the name "linker". A submission is due by 11:59pm on Monday, March 16. Here is some C code:

int array1[10]; void funct (void) { int array2[10];

array2[3] = array1[3]; } (The procedure doesn't do anything very interesting.) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 111 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

a. Give as specific a description as possible of the code that the C compiler will generate to load register $t0 with array1[3]. (Actual instructions are more specific than pseudocode.) b. Give as specific a description as possible of the code that the C compiler will generate to store register $t0 into array2[3]. c. For each of the code segments in parts a and b, we want to know the earliest stage in the compilation process at which the final numeric values of the fields in all the relevant instructions are known. Are the numeric fields in the instructions that load $t0 with array1[3] first known ... by the C compiler, by the assembler, or by the linker? Briefly explain. d. Are the numeric fields in the instructions that store $t0 into array2[3] first known ... by the C compiler, by the assembler, or by the linker? Briefly explain.

23.3.4 (Display page) Homework, part 3 Homework, part 3

Submit answers to these three exercises to Expertiza in a single text file under the name of "circuits". A submission is due by 11:59pm on Monday, March 16.

Combinational circuits implement boolean functions. There are four boolean functions of one variable. They are shown in the following table. i False(i) Identity(i) Not(i) True(i) 0 0 0 1 1 1 0 1 0 1 How many Boolean functions of two variables are there? Make the table. Give names to the functions that you recognize. (Hint: you've seen AND, OR, NOT, XOR so far.) In lab we have worked with AND, OR, and NOT. This set of Boolean operators is complete in that any Boolean functions can be implemented with a combination of these operators. In fact, there are some very useful operators that are complete all by themselves. One of them is

NAND. Its symbol is And its truth table is

a b a nand b 0 0 1 0 1 1 1 0 1 1 1 0 Show how to build AND, OR, and NOT using only NAND. This constitutes a proof that NAND is complete. A multiplexor (often referred to as "mux") is an extremely useful circuit. The MIPS CPU has several multiplexors, as we will see in a couple of weeks. A multiplexor takes three inputs, a, b, and s (a selector). The output is a if s is 0 and b if s is 1. It's diagrammed below.

Provide the truth table of a multiplexor, and give a Boolean expression that relates its output to its inputs.

23.3.5 (Display page) Homework, part 4 Homework, part 4

By 11:59pm on Monday, March 16, provide a peer evaluation of classmate homework 6 submissions designated by Expertiza. Also complete the midsemester survey in an accompanying activity. 24 Combinational logic 2009-3-12 ~ 2009-3-13 (4 activities) 24.1 Combinational circuit optimization(8 steps) 24.1.1 (Display page) Combinational circuit optimization Combinational circuit optimization

In the previous lab you saw the deep relationship between truth tables, boolean expressions, and combinational circuits. The truth table specifies the mapping from inputs to outputs, and it is unique (up to permutations of the rows). For any truth table, there is an infinite set of combinational circuits that implement the mapping, and each combinational circuit has an identical Boolean expression. However, there are canonical circuits that are 1-1 with the truth table, such as the sum-of-products form. Often there are other circuits that have identical functional behavior, but they are better by some metric, for example, smaller, faster, cheaper, or more robust. An elegant body of theory allows us to automate combinational circuit optimization. This is the essence of Computer Aided Design http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 112 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

tools, many of which were pioneered here at Berkeley. Upper-division courses, such as EECS 150, go into that in detail. In CS 61CL we'll get an apprecitiation for it and use some of those tools while we put together useful designs. 24.1.2 (Display page) Boolean algebra Boolean algebra

The system of Boolean algebra (invented by the mathematician George Boole in the 19th century) allows us to manipulate Boolean expressions to make the circuits that implement them faster, smaller, or cheaper. (Remember, in logic • is AND and + is OR. In C, the bitwise versions of these are & and |.) Listed below are useful laws of Boolean algebra.

x • !x = 0 x + !x = 1 complementarity x • 0 = 0 x + 1 = 1 laws of 0's and 1's x • 1 = x x + 0 = x identities x • x = x x + x = x idempotent law x • y = y • x x + y = y + x commutativity (x • y) • z = x • (y • z) (x + y) + z = x + (y + z) associativity x • (y + z) = x•y + x•z x + y•z = (x + y) • (x + z) distribution x•y + x = x (x + y) • x = x uniting theorem !(x•y) = !x + !y !(x + y) = !x • !y DeMorgan's law Notice that all these laws come in two versions, one being called the dual of the other. Practically speakinng, this means you if you know one version of the law, then you can easily derive the other one. Here is an example of using these laws to simplify the expression y = a•b + a + c. y = a•b + a + c = a•(b+1) + c distribution, identity = a•1 + c law of 1's = a + c identity 24.1.3 (Brainstorm) Another XOR An alternative expression for a xor b is (a + b)•(!a + !b). Show that this expression is equivalent to the one previously presented, a xor b = a•!b + !a•b. 24.1.4 (Brainstorm) A useful derived law Using the laws of Boolean algebra, show that a + !a•x = a + x. (This means we can eliminate two of the three gates without changing the behavior of the circuit!) 24.1.5 (Brainstorm) Simplification practice Supply the steps that simplify

y = !a • !b • !c + !a • !b • c + !a • b • c + a • !b • !c +a • b • !c to

!a•!b + !a•c + a•!c 24.1.6 (Display page) Using Boolean algebra for optimization Using Boolean algebra for optimization

The laws of Boolean algebra are used to optimize designs while preserving correctness in much the same way that compilers optimize the machine code produced in the translation from a high level programming language. The logic designer can work at a higher level and produce a design that is easy to understand and to verify. The tools seek to eliminate redundancies, exploit special cases, or transform the circuit to improve an important metric, say delay, at the expense of a less important one, say number of gates. These transformations can be done algebraically or graphically. Many of them seem obvious, but when you have hundreds of millions of gates, having tools that perform these zillions of little optimizations for you, without making any mistakes, is a huge benefit. Here are some simple examples that eliminate unnecessary gates.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 113 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Indeed, all of these are also peephole optimizations used in compilers as well, but in that setting they areused to optimize predicates, loops, and branches. Here we are eliminating gates and wires. Some optimizations improve the performance of a circuit, but don't necessarily reduce its gate count. Some improve both. Here are some more.

((X & Y) & A) & B => (X & Y) & (A & B) (X & A) | (A & B) => (X | Y) & A) 24.1.7 (Brainstorm) Invent an optimization Propose another optimization rule that converts a Boolean expression to an equivalent version that uses fewer operators. 24.1.8 (Display page) Simplifying a sum of products Simplifying a sum of products

In practice, one of the most important rules is the following.

(x & A) | (~x & A) => A Suppose you have a sum of products whose truth table contains two ON rows that differ by complementing one input. The rule above allows you to take the AND gates that come from the two ON rows and smash them together. It eliminates two AND gates and one input to the OR gate. In EECS 150 you learn a clever trick for this called "Karnaugh maps". In CAD courses you learn even more powerful algorithms for this. 24.2 More logic practice(11 steps) 24.2.1 (Display page) The majority function The majority function

Given below is the truth table for the majority function. It takes three inputs, and returns whatever value two or more of the inputs share.

a b c m 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 Devise an implementation with as few gates as possible (our version has three AND's and two OR's), and implement your solution in Logisim. 24.2.2 (Display page) Converting gate types Converting gate types

A simple example of automatic circuit transformation is conversion between gate types. For example, the circuit family you are working with may have only two-input gates as these may be much faster than wider gates. The associative law allow us to form a tree of AND gates to make a wide AND (or a tree of OR gates to make a wide OR). In fact, there are lots of trees that work and they have different performance but the same logical behavior. Let's let the tools do it for us. Copy ~cs61cl/code/adder.circ to a local directory and open it up with Logisim. Double click FA to focus on the full-adder subcircuit. Do Project->Analyze Circuit. (You will see that the inputs and truth table and expressions all reflect the input and output labels.) This time when you Build Circuit, make the name FA2 and click the Use Two-Input Gates Only box. Take a look at the new FA2 circuit.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 114 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What has happened to the OR gate? What has happened to the 3-input AND gates? How many total gates total are used? How many gates are used along the longest path from an input to an output?

The behavior of FA and FA2 is identical. Poke it and try it out. 24.2.3 (Display page) Working with subcircuits Workng with subcircuits

In fact, you could replace FA's with FA2's in main. But one thing you need to be careful about is the layout of pins on the subcircuit symbol. Try dropping an FA2 in main somewhere. It has three pins on the left (the inputs) and two on the right (the outputs). You need to be extremely careful if you change the pins in a subcircuit in any way. (In case we didn't say it strong enough. BEWARE. Modularity through subcircuits is incredibly powerful. But you must establish the interface to the module and keep it.) Rearrange the pins on your FA2 module so that they conform to the orientation in FA. Note that when you mouse over the pins on the symbol, you can see their names. When you've got it right, rip out an FA and replace it with an FA2. Poke at it to verify that it is still a 4-bit adder. 24.2.4 (Display page) NAND gates and De Morgan's theorem NAND gates and De Morgan's theorem

So far we have been using familiar logic operations: AND, OR, and NOT. NAND seems kind of funny. It is like an AND followed by a NOT. Its symbol is the AND with a complementing circle on the output. In Logisim, click on the Gates entry in the explorer panel on the left to see the NAND. Its truth table is

A B nand(A,B) 0 0 1 0 1 1 1 0 1 1 1 0 In this week's homework, you see that this funny gate is actually more powerful than the three more familiar ones—it is complete all by itself. Any logic function can be constructed using only NANDs. Perhaps even more surprising is that it is a lot faster and smaller than an AND gate or and OR gate. It seems that transistors just really like this idea of complementing the output. NOR is complete, fast, and small as well. Well, if that hasn't got you excited about NAND, there's more. You may have run into De Morgan's law. It has two forms.

A + B = ~(~A • ~B) A • B = ~(~A + ~B) To prove the first one, construct the truth table for the "to" side and check if they are the same.

A B A+B ~A ~B (~A & ~B) ~(~A & ~B) 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 0 1 Look at what this does with a sum-of-product circuit. The OR is replaced with a NAND and a bubble (complement) is placed on each of its inputs. Slide the bubble back to the AND gate. Sum of products is a NAND of NANDs! Go back to your FA subcircuit and run Project->Analyze Circuit. This time when you run build circuit, click use only nand gates and give it a new name. 24.2.5 (Display page) Letting tools work for you Letting tools work for you

Start a new project, drop in three input pins and one output. Click on each pin to bring up the properties in the lower left portion. Update the labels to reflect the names used in the columns of the truth table for the majority function. Enter the truth table. Take a look at the expression. Did it find the same optimizations that you did? Build the circuit. Build it again using 2-input gates. 24.2.6 (Display page) Cranking up the CAD tools Cranking up the CAD tools

Your ripple carry adder is pretty easy to understand and verify. It simply mimics adding pairs of bits and the carry-in from right to left. To see where the power of modularity comes in and so see how optimization tools can do things that humans are really not very good at, let's make Logisim do a bit of heavy lifting. Open up adder.circ once again. This time run Project->Analyze Circuit on your 4-bit adder in main, rather than on the bit slice in FA. It generates a huge truth table. Click build circuit and put it in a new subcircuit. Can you figure out what is going on in the result? 24.2.7 (Display page) Building a decoder Building a decoder

One of the important concepts in machine structures is number systems. While we were learning about integers in C we had to make the shift from decimal representation to binary representation. Both are positionally valued with the digits (bits) multiplied by increasing powers of the base. Your fingers may seem a lot like bits, but for some reason we use a completely different number system when we count on our hands. There we use a kind of unary representation. The number is the count of the fingers that we are holding up. Just as we converted between numbers in different bases, we can convert between binary and unary. In fact, your full adder converted three unary bits into two binary bits: the sum is the 1's place and the carry is the 2's place. When we string them together, we add three digits of the same position value to produce one of the same position value and one of the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 115 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

next larger position. A decoder does the reverse. It converts a binary input into a unary output. However, instead of the number of fingers, it is the position of the finger that gives the value. (Yes, counting on your fingers is a very inefficient number system!) The truth value for a 2-bit decoder is given below. Create a new project and implement this decoder using logic gates.

S1 S0 | D3 D2 D1 D0 0 0 | 0 0 0 1 0 1 | 0 0 1 0 1 0 | 0 1 0 0 1 1 | 1 0 0 0

It should have four data outputs, d0, d1, d2, d3 on the right side, and two control inputs, s1 and s0. When you are done, you can compare your solution to the one in ~cs61cl/code/decoder.circ. One new bit of Logisim that you will want is a constant. Click on gates in the explorer panel on the left to find one. 24.2.8 (Display page) Multiplexing and demultiplexing Multiplexing and demultiplexing

In this week's homework you are constructing a 2-1 Multiplexor (MUX). The other name for a multiplexor is a selector. The control inputs select one of the data inputs to appear on the output. We will use these a lot. If you haven't done that exercise already, you will probably want to do it now. A demultiplexor is the inverse of a multiplexor. It steers the input to the one of 2n outputs that is specified by the n control inputs. Modify your decoder so that it becomes a demultiplexor. The truth table for this demultiplexor appears below. In this case, however, you will find it easier to think structurally and just steer the input to the correct output, rather than cranking through the truth table.

S1 S0 IN | D3 D2 D1 D0 0 0 0 | 0 0 0 0 0 0 1 | 0 0 0 1 0 1 0 | 0 0 0 0 0 1 1 | 0 0 1 0 1 0 0 | 0 0 0 0 1 0 1 | 0 1 0 0 1 1 0 | 0 0 0 0 1 1 1 | 1 0 0 0 24.2.9 (Display page) Bundles of bits Bundles of bits

In learning C and MIPS machine language, we have come to see that variables are not really numbers and strings. They are collections of bits organized as words that we manipulate in various ways according to what those collections of bits represent. At the lowest level of the machine we have signals, individual wires that carry a voltage representing the logical value of a bit at any point in time, and busses collection of signals that carry a closely related collection of bits, e.g., a word. All CAD tools provide some means of going between the signal view of individual wires and the bus view of wider logical units. Logisim handles this duality with two mechanisms. First, the splitter tool can either take several individual signals and group them together into a bundle or split a bundle into its individual signals. Second, each input to a gate can have a bit width, as well as a number of inputs. So far, all our inputs were one bit wide. The logical symbols can be used on every signal in a bundle. This is a little like the bitwise operators in C (&, |, and ^), as they do a bitwise logical operation on each bit in the (scalar) object, i.e., char or int. Let's first use the concept of a data bus on your four-bit adder. Open it up under Logisim and remove the individual input pins and output pins. (Grab the select tool, right-click on the object and select delete or select a region and cut.) Under the Base menu, grab the splitter tool. Change its properties so that it faces south and has a bit width of 4 and a fanout of four. Drop this above your 4-bit adder built out of FA's. Attach a wire to the top, i.e, the 4-bit wide part. Grab the input pin tool, set it to be south-facing and four bits wide. Attach it to your wire and label it A. Attach the individual signals to the A input of your FA's. Do the same for the B side. Grab the splitter again, but make it north-facing with a bit width of 5 and an fanout of 5. (This is a bit of a misnomer, because it will really be a fan-in). Why does it need to be five bits wide? What is the fifth bit? Connect the sum output and the carry out to your unsplitter and put a 5-wide output pin on it facing north. It should look something like what is below. Poke the inputs. Now this adds binary numbers, rather than a bunch of bits.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 116 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

While you've got the Base menu open, notice the Probe tool. This is your debugging friend. Grab it and drop it on each of the internal carries. Poke the inputs to see that it is showing you what's going on. Here you get the same information as the highlighting on the signal. But you can make the probe multiple bits wide and drop it on busses. Try it. 24.2.10 (Display page) A logical unit A logic unit

Let's get some experience with gates that are several bits wide. Copy over ~cs61cl/code/and4.circ and open it up. This contains a 4 bit wide bitwise & circuit using spliters and 2-input 1-bit wide gates. Rip out the spliters and the gates and the wires connecting them. Replace all of that with a single 2-input 4-bit wide and gate and wire it to the input and output pins. Poke it to convince yourself that it is the same circuit. Now we can move up one level of design. The Logism gates are really a library of gates of different sizes and widths. You know how to build multiplexors and complex boolean functions out of individual logic gates. Logisim provides a library of the larger devices too. Go to Project->Load Library->Built-in Library and select Plexers. Open that up. See some familiar parts. Let's build a simple 16-bit logic unit. It has four functions: AND, OR, XOR, EQ. It has two 16-bit inputs: A and B, and one 16-bit output, R. It has a 2-bit control input OP that selects which function is performed, according to the following table.

OP Function 00 AND 01 OR 10 XOR 11 EQ Build this using four 16-bit wide logic gates and one 16-bit wide multiplexor. 24.2.11 (Display page) A shifter A shifter

All sorts of useful circuits can be built with multiplexors and splitters. Build a 4-bit wide arithmetic right shifter. It should have a 4-bit data input D, a 4-bit data output R, and a 2-bit control input S that is the shift amount. Recall that arithmetic right shift means that you fill with the leftmost bit (the sign bit) rather than zero when you shift to the right. The rightmost bits just disappear. You can build all of this with a single 4-way 4-bit wide MUX and some splitters. However, as soon as you have a bundle of bits you need to keep track of the endians. Hover your select tool over the individual bits of the split. See what they say. Pay careful attention to the marking on the MUX. See which is the 0 input? 24.3 Project 2 checkoff(1 step) 24.3.1 (Display page) Checkoff requirements Checkoff requirements

During lab on March 12 or 13, show progress on project 2. Up to 4 homework points will be awarded for this milestone, 1 point for each of the following cases (translated to assembly language) you can produce the correct output for up to a maximum of 4.

int k; char outstring[20]; k = snprintf (outstring, 20, "%d", 5);

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 117 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

k = snprintf (outstring, 20, "%u", 7); k = snprintf (outstring, 20, "%x", 9); k = snprintf (outstring, 20, "%c", 'x'); k = snprintf (outstring, 20, "%s", "abc"); k = snprintf (outstring, 20, "%%"); 24.4 Homework and readings(7 steps) 24.4.1 (Display page) Homework Homework

Homework assignment 8, due Monday, March 30 at 11:59pm, includes the five exercises that follow, plus peer evaluation of homework 7. 24.4.2 (Display page) Homework, part 1 Homework, part 1

Implement a 4-2 multiplexor (four inputs, two selector inputs, and two outputs) using nothing more than 2-1 multiplexors. Submit your solution, a file named mux4_2.circ, to Expertiza under the name mux4_2. 24.4.3 (Display page) Homework, part 2 Homework, part 2

Using the same method that you will have for an adder in lab, implement a 4-bit subtractor. First, construct a truth table for a bit slice. It has three inputs: A, B, and BRin (borrow in). It produces two outputs: R and BRout. The way it works is just like when you learned to subtract. A minus the Borrow minus B produces one bit of the result and a Borrow (which may be zero) from the next higher position. Do the following example by hand first.

A: 1 1 0 1 (13) B: -0 0 1 1 (3) ------c: 1 0 1 0 (10)

Submit your solution, a file named subtractor.circ, to Expertiza under the name subtractor. 24.4.4 (Display page) Homework, part 3 Homework, part 3

You are to implement a 4-bit 4-input circuit that sorts its inputs in a single clock cycle (hence the name "parallel" or "combinational" sort). Shown below is a schematic of the ParallelSort module. Your job is to translate this schematic into Logisim. We have already done some of the work for you; you may create new wires as needed, but you may not add inputs or outputs.

If you have a few extra minutes:

Try to draw the schematic for Exchange4. Attempt to simplify the circuit by reducing the number of Exchange4 instances. Is it possible? Why or why not? Can you come up with a design for a sequential sorter which uses fewer Exchange4 instances by sharing them over time?

Submit your solution, a file named sorter.circ, to Expertiza under the name sorter. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 118 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

24.4.5 (Display page) Homework, part 4 Homework, part 4

Modify your 4-bit counter from lab on March 17-18 so that it is a configurable up/down counter. Do this by introducing an additional flipflop that holds the "state", i.e., a bit that determines whether it counts up or down and an input to set this state bit. Do not change the structure that you have built with FAs and D FFs. Instead, use the state bit to select one of two increment values with a mux. Submit your solution, a file named counter.circ, to Expertiza under the name counter. 24.4.6 (Display page) Homework, part 5 Homework, part 5

In lab on March 17-18 you saw a shift register (online in the file ~cs61cl/code/shiftreg.circ). A one-dimensional pong game of sorts can be designed in the same way. It should have a set of five D FFs like in the shift register, but in this case when the 1 hits the right end it reverse and heads back to the left. When it hits the left end it reverses again. Like the up/down counter, you will want to introduce a bit of state that is the direction. Use muxes to steer the bits in the correct direction. Submit a file named pong.circ to Expertiza as pong. 24.4.7 (Display page) Readings Readings

Material relevant for next week's lab activities may be found in P&H sections C.5-C.8 (fourth edition) or B.5-B.8 (third edition). 25 Synchronous circuits and flip-flops 2009-3-17 ~ 2009-3-18 (3 activities) 25.2 Elements with state(11 steps) 25.2.1 (Display page) Introduction to storage elements Introduction to storage elements

In previous labs we have seen that we can construct any Boolean function as an acyclic graph of Boolean gates. We call these combinational circuits; the outputs are purely a function of the input. In our simulator, changes on the inputs propagate instantaneously to the outputs. In reality, such things take time. For example, the speed of a specific technology is often described in terms of the propagation delay of a single inverter driving a typical load. If we wait long enough for the outputs to settle, they are purely a function of the inputs. For a circuit with 64 gates on the longest path from an input to an output (called the critical path) this will take a lot longer than one with only 8 gates on the critical path. In the synchronous design methodology, digital systems are constructed using two kinds of circuit elements: combinational gates and storage elements. All feedback from outputs to inputs must go through a storage element. There is a special signal, called the clock, that allows the storage elements to capture and hold onto their inputs. Whenever the clock ticks, all the storage elements grab their inputs and record them internally. Soon afterwards, the new inputs will be visible on the outputs of the storage elements, whereupon they feed into the inputs of combinational logic and propagate along until they reach the input of a storage element. There, the value sits unheeded until the clock ticks again. A schema for this style of design is shown in the figure below. Acyclic blocks of combinational logic take their inputs from synchronous storage elements, called registers, and their outputs feed into registers. All cycles in the circuit include at least one register. When the clock ticks the entire state of the system is updated.

Often the performance of a digital system is indicated by its clock frequency, called the clock rate. For example, current PCs run at a little more than 3 GHz, that is, 3 billion clock cycles per second. (One Hertz = one cycle per second.) The clock cycle time must be long enough to accommodate the longest propagation delay from any register output to any input. A typical rule of thumb is that the circuit should be designed so that this is roughly 10 levels of logic. Thus at 3.3 GHz, the clock cycle is 300 picoseconds; that allows for 30 picosends of propagation per gate. Actually, light takes almost http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 119 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

100 picoseconds to cross the width of a large ship, so the allowed gate delay is more like 15 picoseconds. Pretty fast. 25.2.2 (Display page) Delay Propagation delay: a simple glitch

The figure below shows a very simple circuit that implements Y = X & ~X. As a Boolean function, you would certainly say that Y = false. Build the circuit in Logisim. Be sure to label the input pin and the output pin. Let's now look at a useful tool in Logisim that we will use through this lab. Pull down Simulate->Logging to bring up the Log window. On the Selection tab, add the input and output pin labels (in that order). Click the Table tab. With the poke tool on your circuit, poke the input pin. You will see the the table of inputs and outputs build as you poke it.

The the table that you are looking at is a waveform—a sequence of values over time. Here time is going down vertically and it only shows the changes. When you change the input it simulates the circuit until the output is determined. Here, that is always 0. If you had a real AND gate and INV gate and hooked them up to an oscilloscope to watch the voltage on various signals over time, the waveform for X would look like the top graph in the figure above. It is zero for a while and at some time you raised it to 1 and then later you lowered it to 0. However, if you were to to look at Y, you would see that just a tiny tiny time after you raised X there is a little glitch where Y goes to 1. This is because gates take time. If you were to look inside at the intermediate signal t, it would look like an inverted version of X, except that it is delayed by one gate delay. This means that for one gate delay (perhaps 15 picoseconds) both X and t are equal to 1! A gate delay later this will appear at the output Y, but it will disappear one gate delay later. Question: what will happen when X is dropped to zero again? 25.2.3 (Display page) Cycles in combinational logic Cycles in combinational logic

In the synchronous design methodology, we never allow cycles within combinational logic. If cycles were to occur, they would make the outputs even more history-sensitive than the glitch we saw above. In Logisim, build the circuit shown below that breaks all the rules. When in == 1, this circuit represents a logical contradiction:

out == ~~~out == ~out. The output cannot settle; if it settled to 1 it would need to be 0 and if it settled to 0 it would need to be 1. Watch what happens when you poke the input to 1.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 120 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Logisim detects the cycle, marks it as inconsistent, and refuses to simulate it. If you wired it up in the lab, it will do something. It won't blow up. Based on what you saw in the glitch, what do you think this circuit will do? 25.2.4 (Display page) Clocks and flip-flops Clocks and flip-flops

If all we had were combinational elements, our digital systems could not remember anything. We need to introduce history dependence in a controlled manner. In other words, we need state. The basic cicuit element that holds state is a flip-flop. The name comes from its behavior—it can flip from one state to another and hold it. In order to use flip-flops, we introduce a special waveform, called a clock, that oscillates (or ticks) with a well-defined period. In fact, we engineer the clock so that it looks like a periodic square wave and its edges are carefully engineered to be the same throughout the design. The figure below shows a sketch of the clock wave form and the flip-flop we will be using. This is a positive-edge-triggered D flip- flop. In addition to the clock input, it has a data input D. It holds a single bit of state, which is visible on its output Q. On the rising edge of the clock it captures D and stores it internally. Any other time, you can wiggle the input all you like; it won't change the output. In our synchronous design methodology, we make sure that all the inputs are stable before the edge arrives (remember, the clock cycle is longer than the delay along the critical path). Often a flip-flop will have asynchronous inputs as well to clear or set the internal value.

In this class, we won't actually see how to construct flip-flips from transistors or gates. There is an art to it and you can study it in EECS 150 if you like. Instead, we are going to use built-in elements. In Logisim, start a new project. Go to Project->Load Library->Built-in Library and select Memory. Opening this up you will see a D Flip-Flop. Add one to your empty schematic. Mouse over its pins so you can see its connections. You can read about it under Help. Connect an input pin to D and an output pin to Q. Under Base you will find a special pin called Clock. Drop it in and connect it to clock. It should look something like the following. Go ahead and open a logging window. Poke at D and see that nothing changes. You can poke clock, but Logsim has special support for the clock. Pull down the Simulate memu. You will see an http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 121 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

option to tick once and also option to set the frequency and to start the clock running.

Set the Tick Frequency to 0.25 hz (4 seconds per cycle). Enable the clock. You will see in the log and on the screen that the output only changes when the clock ticks. You can poke at the input in between. All that matters is the value when the clock ticks. 25.2.5 (Display page) A simple circuit with state A simple circuit with state

We can create a simple circuit with state by removing the input pin in the clock circuit that you created above and instead feed the output back to the input through an inverter. Make this adjustment to your circuit. Tick the clock with the poke tool. Enable logging and observe the clock and the output. When does the output change? How is the frequency of the output waveform related to the frequency of the clock signal? This circuit is about the simplest example of our synchronous design methodology. Pretty much the whole circuit consists of a cycle containing one D flip-flop and one inverter. (Incidently, the thing you just created is called a T flip-flop. It toggles every clock tick.) Given that your circuit has no input, you might wonder how do you initialize it. If you go to Simulate- >Reset Simulation you will see that the whole circuit goes undefined. When you tick the clock it sorts itself out. A better way to do this is to introduce an explicit reset. The builtin Logisim D flipflop has two asynchronous inputs, one to set it and one to clear it. Both ignore the clock. Attach a pin to the clear input. With this set to 1 you can let the clock tick away and the out value will be held to 0. When you release the clear it will start to toggle on the clock. The clear can be asserted at any time and it will reset the flip-flop, even if the clock is not ticking. 25.2.6 (Display page) Counters Counters

Grab ~cs61cl/code/adder.circ as a starting point for a four-bit adder built out of FA subcircuits. Rip out the inputs and outputs so that you have the basic ripple carry adder. Add four D flipflops to hold the value of the counter. Connect the FAs and the D so that when the clock ticks, the state in the four D FFs advances by one, i.e., i' = i+1 mod 15. You will want to attach a clear to all the D FFs so that you can reset the counter. Try to figure it out. If you need a hint, click here. Notice that every time the clock ticks, all state elements capture their new input simultaneously. Thus the new state is a function of the old state. When we put a collection of flip-flops together like this with a common set of control and clock inputs so that they function as a single unit, we call it a register. Here you have built a 4-bit register. With the adder it is a 4-bit counter. Modify your circuit so that the constant input comes from a set of input pins, rather than the value 0001. Now you have a "by- N counter". At each clock tick, the new state is purely a function of the old state and the input pins. In this case, the output of the circuit is exactly the state. Grab your old majority circuit that you built in an earlier lab and attach the four outputs of the state register to it. Now you have a nice logic tester. If you put the state bits and the majority output into the logging tool and let the clock tick cycle through 0 - 15, the result should look exactly like your old truth table. What you have built here is a finite-state machine. In fact it is called a "Moore machine". The new state is a function of the state and the inputs. The output is a function of the state. You can describe its behavior with a bubble-and-arc diagram. This machine implements a particular kind of state machine that just counts in a circle. The orbit depends on what you count by. 25.2.7 (Brainstorm) 1 banana, 2 banana ... You have now built a 4-bit counter with a variable count increment. With 4 bits it has a maximum of 16 states. For what increment values will the "orbit" of states include all 16 values? What does the counter do when you set the increment to 1111? Why? 25.2.8 (Display page) Shift register Shift register

Copy the file ~cs61cl/code/shiftreg.circ. This implements a 4-bit shift register. Start the clock

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 122 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

ticking with init set to 1. Poke it to 0 and let it fly. Notice that with 4 bits this counter cycles through 4 states while your counter cycles through 24 states. Modify this circuit to create what is called a "Johnson counter" that cycles through 2*4 states as follows.

0000 1000 1100 1110 1111 0111 0011 0001 25.2.9 (Display page) State machine Finite state machine

(This activity is part of homework assignment 8.) Modify your 4-bit counter so that it is a configurable up/down counter. Do this by introducing an additional flipflop that holds the "state", i.e., a bit that determines whether it counts up or down and an input to set this state bit. Do not change the structure that you have built with FAs and D FFs. Instead, use the state bit to select one of two increment values with a mux. 25.2.10 (Display page) Pong Pong

(This exercise is part of homework assignment 8.) Now let's build a one-dimensional pong game of sorts. This should have a set of five D FFs as in the shift register, but in this case when the 1 hits the right end it reverses and heads back to the left. When it hits the left end it reverses again. Like the up/down counter, you will want to introduce a bit of state that is the direction. Use muxes to steer the bits in the correct direction. If you have the time and inclination, you can attach LEDS and buttons to your circuit from the built-in input-output library. 25.2.11 (Display page) Up leveling the design Up leveling the design

Structures like adders and registers are so important that most CAD tools give you built-in versions that are highly optimized. Rebuild your 4-bit counter using a 4-bit register out of the built-in Memory library and a 4-bit adder out of the built-in arithmetic library. Use multibit wires wherever possible. 25.3 Homework and readings(2 steps) 25.3.1 (Display page) Homework Homework

Homework assignment 8 and your peer evaluation of homework assignment 7 are both due by 11:59pm, Monday, March 30. (Note that exercise 8.3 has been slightly revised.) Solutions and grading standards for homework 7 are now available. 25.3.2 (Display page) Readings Readings

P&H sections C.9-C.11 (fourth edition) or B.9-B.11 (third edition) are useful background reading for this week's activities. 26 Registers 2009-3-19 ~ 2009-3-20 (3 activities) 26.1 Work with registers.(6 steps) 26.1.1 (Display page) Introduction Introduction

Collections of flip-flops with common control signals are registers. They are the fundamental containers of state in digital systems. Generally, the number of bits in a register is the word width of the design. The semantics of its operations are specified as a sequence of register transfers. On the clock edge, information transfers from certain registers to certain other registers based on the control signals at that time. The design of digital systems is a classic example of "divide and conquer". We divide the problem into two simpler problems: datapath and control. The datapath consists of registers, function units, and busses that connect outputs to inputs. The datapath has to be able to support the register transfers and operations required to implement the design. But it does not have any of the intelligence. It does not know which register transfers to do when. It has a bunch of control inputs that tell it what to do. It is a puppet and the control points are the strings. It also has some outputs, called signals, that indicate when certain important events occur. The other piece of the solutions is the controller. Its whole job is to figure out what register transfers to do when and to assert the control points that cause these register transfers to be carried out. The controller is essentially a finite state machine. It has a control state. Its inputs are the signals from the datapath and its outputs are the control points. We will gain more experience with these concepts of datapath and control over this and the next couple of labs as we move toward implementing a complete processor. The controllers for instruction interpretation are highly specialized. 26.1.2 (Display page) Register subcircuit Register subcircuit

Start a new project called RTL and using Project->Load Library->Built-in Library, load Memory, Plexers, and Arithmetic. This will give us tools to work at the RTL level. Add a subcircuit called reg16. In this we'll create a 16-bit register with synchronous parallel load and asynchronous clear. Focus on reg16. Expand Memory and click on the register element. Its only attribute is bit width. Make it 16. Add a register to your reg16 circuit. Mouse over its pins. Like the D flip-flop it has a data input D, a data output Q, an asynchronous clr, and a clock. Every cycle, this will capture http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 123 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

whatever is on D, store it internally, and drive it on the outputs for the entire clock period. Often we want to have lots of registers around to hold values, and we decide when to load a new value into each register. We'll create a subcircuit that reflects that concept. This is also the time to establish the "coding conventions" for our modules—for example, data inputs on the left, data outputs on the right, and control on the top. This will let us build datapaths that run left to right, like in the book. Logisim treats the clock specially. It has a special element type, and you can tick it manually or automatically. We will wire the clock signal locally in the subcircuit; this will save us a lot of routing clock wires and ensures that it is handled consistently. You never never gate the clock. It just ticks. Selective loading of registers and the like is determined by the control signals. To provide the parallel load capbility in your register subcircuit, place a MUX on the input and bring the output back to the 0 input of the mux. The 16-bit external input pin connects to the other input. The selector is your LOAD control signal. You'll need an input pin for clr and a 16-bit output. Now we have a register subcircuit with selective load. Place it in your main schematic and verify its behavior. Note that you can poke at individual bits in a multi-bit input pin. Notice also that the internal value of a register shows in its base symbol, but not when it is an element in a subcircuit. The way to gain visibility in the enclosing circuit when the output is not brought out to a pin is to attach a probe to the register subcircuit output. You will use this a lot in your project 3. 26.1.3 (Display page) Decrementer Decrementer

Build a decrement subcircuit using your reg16. You can use any of the elements in the built-in library including MUX and subtractor. It should have a 16-bit input In, a 16-bit output Out, two control input pins ld and dec and one output pin zero. Zero is asserted when reg == 0. The operational behavior is as follows.

ld dec 0 0 - hold value in register 0 1 - if (reg != 0) reg = reg - 1 1 x - reg = In You should be able to assert ld, tick the clock a full cycle (two ctrl-T), deassert ld and assert dec, and then watch it decrement down to zero as you tick the clock. 26.1.4 (Display page) A timer: a FSM design illustration A timer: a FSM design illustration

Your decrement subcircuit provides essentially the whole datapath for a timer. We'll walk you through an exampple design and have you modify it. Ours is in ~cs61cl/code/timer.circ. To run this, set Simulate->Tick Frequency to 1 Hz so you can see what's going on. Simulate->Ticks Enabled to start the clock running. It is now sitting idle. Poke at the time input. Click go. To build this, first, we dropped in a decrementer, connect a 16-bit input pin to In, a 16-bit output pin to Out, and and LED to zero. To make sure it all works, we put a pin on the Ld and Dec lines. Assert Ld. Step the clock twice (the rising edge and the falling edge). Unassert Ld and assert Dec. Step the clock till it reaches zero and the LED goes on. OK, this manual control is what we want to automate by building a control state machine. Before we do that, let's capture an real world input. In Logisim's Input/Output library there is a button element. Drop one in your design and put a probe on the output. You'll see that it is 1 only while you hold it down. We'd like to use such a button to start our timer. The problem is that you might push it for an instant that does not quite include any clock edge. The way our state elements work is they sample their inputs right on the rising edge of the clock. We'll clean up the button by using a synchronizer. Drop a D flipflop in the design and connect the button to the asynchronous set pin (the left pin on the lower edge). This will cause the FF to be set to 1 when the button is pushed. Connect the constant 0 to the D input and connect the clock to the clock input. The output of the FF is a synchronized version of the button. Whenever you press the button it will go 1 until the next clock edge—just long enough for other parts of the circuit to use the signal. Now we are ready to build a FSM controller to make the decrementer behave like a timer. Informally, when the button is pressed it should load the duration value and then it should count down till it reaches zero and turns on the LED. Designing a state machine is a stylized process. First we construct a symbolic state transition diagram (STD). This is a high-level bubble and arc graph like one might see in a discrete math course. Each bubble represents a state. Inside the bubble we write down the register transfer that are to occur in that state. On the clock tick the specified register transfers is performed and transition to the next state occurs based on the inputs. We indicate the input by labeling the arcs with the input expression that causes that transition. The leftmost diagram below shows the symbolic STG for our timer.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 124 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Here we sit in the 'idle' state until we get the start signal. We transition to the 'set' state to load the register. We are in this state for exactly one tick, since the transition to the run state is unconditional. In that state, we decrement reg until the zero signal occurs. The next stage in the FSM design process is to assign a representation to the states and convert the register transfers to the control signals that will affect those transfers. Here we have made those two things the same. We have three states, so we need two state bits. We'll call them L and D. So 'idle' is 00, 'run' is 01, and set is '10'. In 'set' we assert L and deassert D, so the output that causes reg <= In is identical to the state bits. (Isn't that handy?) The final stage is to convert the concrete state diagram to a truth table and generate logic that computes the state bits. Look carefully at the truth table. The input columns are the current state and the 'inputs'. (X is a don't care. It is a convenient way to summarize multiple rows.) The output columns are the new state bits. When the clock ticks, the new state bits will be loaded into to state FFs, thus advancing the state. Finally, we make sure that if the machine were to happen to be powered up in an illegal state (here 11) it would do something reasonable (here go to the idle state). In timer.circ you have the classic form of a "Moore" FSM.

A set of state bits (here 2). A block of combination logic that computes outputs from the state (here no gates are needed because the outputs are equal to the state bits). A set of inputs (here start and zero). A block of combinational logic that computes the new state from the state and the inputs.

26.1.5 (Display page) Counting up Counting up

Modify the timer design so that instead of counting down from N to 0 it counts up from 0 to N. (Be sure to save it as a new project.) 26.1.6 (Display page) Add it up Add it up

Further modify your up-counter to include an accumulator. This means another register and another adder. When you start the machine it should clear the accumulator register. In each tick it should do

acc = acc + reg; as well as

reg = reg + 1; To check the correctness of your solution you can verify that the result is equal to the sum of an arithmetic series. (Remember (A(1) + A(n))*n/2.) 26.2 Memory(4 steps) 26.2.1 (Display page) Random Access Memory (RAM) Random Access Memory (RAM)

Registers form the dedicated storage elements within the processor. They have specific inputs and outputs and control signals. Memory is a large, regular block of storage elements that are individually addressed, and are explicitly read or written. We call it Random Access Memory (RAM) because any word can be accessed as easily as any other. We have met several important abstractions in this class, including the instruction set abstraction at the hardware/software boundary and the digital abstraction at the logic gate. Here we meet the memory abstraction. A memory has two operations, read and write. We write a data value at an address. When we read an address we get the most recent value written to that address. A memory element has an address bus, a data bus, and some control signals. The width of the data is the unit of information that is read or written. The number of words that can be address with n bits is 2n. Actual memory chips have a specific height and width and number of

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 125 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

data lines and address lines, etc. And they come with a variety of different choices for using the control signals to perform operations. In Logisim we can pick the size of memory we want. It is a "synchronous write, asynchronous read" protocol. We'll explain this as we go. Create a new project and add the memory built-in library to it. Open up the memory library and click on RAM. The attributes give you the opportunity to size the memory. Make the data bit width 16 bits and the address bit width 8. (How many bytes are in this memory?) Drop it into your schematic. Note that there is only one set of data lines for reading and writing. A single wire can only be used for one thing at a time, so we will need to take turns. Connect input pins of the proper width to A and out. You can connect a 1 to sel. We will focus first on reading, so put an output pin on the D line. Memories are used to hold lots of information. In Logisim we can initialize the contents of the memory. Right click the memory and select Edit Contents. Put some recognizable values in the various addresses. (You will notice that you can also save to a file or load from a file. We'll use this later.) Assert out. Now the memory acts like a superduper combinational gate. The input is the address and the output is the contents of the memory at that address. Poke the address bits to select different addresses and check that the corresponding contents appear on D. We are reading it continuously. Poke the out to 0. Note that the D lines all read x. This is not "don't care"; this x is "undefined". The RAM is no longer driving any value on those lines. In order for something else to write the RAM, it needs to treat D as an input, not an output. Add a input pin to the D line (you may need to extend the wire as a T). Add a clock to the clk input with a wire so that you can see when the clock is high or low. Poke your new data input to be something different from the contents of the currently selected address. (Note that now the input pins are driving the D line and the output pins on that line follow along.) Click the clock. On the rising edge the D value is written to the selected address. Click the clock again to complete the cycle. Play with this some more changing the address and the value and writing various words. Now we are going to cause some trouble. Assert out. Look what happens to D. The orange indicates that two different sources are trying to drive the same wire. What value is on the wire? If it happens that they both drive low it will be a zero, or both high a one. What if one low and the other high? Logisim indicates this bit as E for ERROR as shown below. In the real hardware bad things happen and things get hot. Don't let this happen.

To solve this problem we need the input pins to behave as nicely as the memory. We use a special kind of gate called a tri-state buffer. In Logisim under Gates it is called a "controlled buffer". When the control input is high it passes the data through. When it is low, the output disconnects. This is the tri in tri-state. The output goes to high impedence, like an infinite resistor. It should look something like the following.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 126 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Add a new input labeled read and connect it to out. Invert it and connect it to the tri-state control. Now we have some thing that behaves like a memory. Set it to read and give it an address—it gives you data. Set it to write (!read) and tick the clock to write the data. 26.2.2 (Display page) Word sum Word sum

Drop a 16-bit wide RAM into your accumulator project. In this case we will only be reading the RAM, so you don't have to worry about the write logic. Connect the lower bits of your counter to the address lines. The data input to your accumulator should be the D lines. Now when you run it with input N it should sum the first N words of the memory, leaving the result in the accumulator. 26.2.3 (Brainstorm) Sequencer How is the reg in your word sum similar or different from the PC in your MIPS machine? How is the accumulator similar or different from $1 in your MIPS machine? 26.2.4 (Display page) Register file Register File

To wrap up, let's put several of the ideas of last lab and this one together and get an idea of what is inside a memory. In your MIPS machine you have individual registers like PC, but the core of the instruction set architecture is a collection of 32 (well, 31) general purpose registers. This is a lot like a little memory. The "address" is the register specifier in the instruction. The data interface is a lot more elaborate since you can read two registers and write one all at the same time. We'll build a scaled down register file. Create a project that has your reg16 subcircuit in it. Drop eight of these down in the schematic. Add an 16-bit input Din and connect it to all the D inputs. This gives us a way to potentially write any register. To read one, we need to select the one to read. Bring all eight Q outputs to a 16-bit wide 8-to-1 MUX with 3 selector inputs. Connect the selector to an input called A and drop an output pin on the output of the MUX called Dout. Now we have a small RAM. We can read any of the registers just by changing the address bits. How do we write just one? Drop in a decoder with 3 select bits and 8 outputs. Connect one to the Ld control for each register. You could connect the two sets of address lines together or you could keep them separate (like Rd and Rs). It's like a memory but smaller and simpler. There is none of the read/write handshake. You can read any register and write any register simultaneously. Try it out and make sure it works. Any idea how you would make R0 read as zero? What would you do if you didn't want to write any register during some cycle? 26.3 Homework and reading(2 steps) 26.3.1 (Display page) Reading Reading

Next week's labs will cover the design of the MIPS single-cycle CPU. Please read P&H sections 4.1- 4.4 (fourth edition) or sections 5.1-5.4 (third edition) as preparation. 26.3.2 (Display page) Already assigned homework Already assigned homework

Just a reminder ...

Project 2 is due Monday night, March 23. Submit it on your instructional account, using the command

submit proj2

from the directory that contains your snprintf.s file. Homework assignment 8 and the peer evaluations for homework assignment 7 are due Monday night, March 30. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 127 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

27 MIPS CPU: datapath 2009-3-31 ~ 2009-4-01 (5 activities) 27.2 Project 3 milestone(1 step) 27.2.1 (Display page) Checkoff requirements Checkoff requirements

Present to your lab t.a. or reader three diagrams and tables:

a datapath diagram similar to P&H Figure 4.1 (fourth edition) or 5.1 (third edition); a diagram similar to P&H Figure 4.2 (fourth edition) or 5.2 (third edition) that shows control signals as well as data path; a truth table relating each op code (inputs) to corresponding control signal values (outputs).

These will end up as part of your readme file. Also provide tested implementations of two of the following components:

the register file; the ALU; the rotate-right units; or the basic decoder.

27.3 The ALU (Arithmetic and Logic Unit)(8 steps) 27.3.1 (Display page) Patterson and Hennessy references P&H references

Activities for this week's labs are based on coverage in P&H sections 4.1-4.4 (fourth edition) or 5.1- 5.4 (third edition). In particular, their Figure 4.17 (fourth edition) or 5.17 (third edition), which displays the datapath and control for a single-cycle MIPS CPU, will be a useful resource. Paper copies are available in the lab, and the diagram itself is online at the URL http://inst.eecs.berkeley.edu/~cs61cl/fa07/diagrams/5.17.png. In subsequent steps, we will use the fourth edition references. Just change "4" to "5" if you're using the third edition. 27.3.2 (Display page) ALU operations ALU operations

This week we will bring together

what we have learned about designing digital systems using combinational logic and state elements with what we learned about machine instructions

and study a simple implementation of the MIPS instruction set architecture. You will want to have your copy of P&H handy. The first three sections of chapter 4 introduce the concepts involved in implementing instructions from the top down. You should at least scan through those before starting the lab materials. If you look at Figure 4.1, you will notice that we saw most of the parts in recent labs. The figure has instruction and data memories that are essentially the RAM we used. The PC and associated adders is similar to the incrementer. There is a register file and an ALU. We've got a bit more work to do to make a real machine out of these. To see where we are going, Figure 4.17 in P&H provides a road map for our coverage. We'll start with the Arithmetic and Logic Unit (the ALU). This unit implements arithmetic and logic operations, as the name suggests:

addition, subtraction, multiplication, and division; and, or, not, nor, and xor; comparison; shifting.

In addition to the main ALU, there are three "mini-ALUs": one to add 4 to the program counter, another to compute a branch address, and the third to shift the contents of a branch immediate field. 27.3.3 (Display page) The adder The adder

The adder is given two 32-bit integers as inputs, and produces the sum as output. You have already seen how one might go about designing an adder. Here's a quick review. Examining the truth table seems fruitless—there are 264 rows! Instead, we look for a way to break the problem down into smaller circuits that can be combined into the circuit we want. The standard addition-by-hand algorithm we learned in grade school provides a hint for how to do this. Consider, for example, the addition of two 4-bit values a and b, producing a sum s.

a3 + a2 + a1 + a0 + b3 + b2 + b1 + b0 ------s3 + s2 + s1 + s0

We start by adding the least significant bits, computing a0 + b0 = s0. A second ou tput will represent a potential carry into column 1; we'll call that c1. Here's the truth table.

a0 b0 s0 c1 0 0 0 0 0 1 1 0 1 0 1 0 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 128 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

1 1 0 1 Adding in the second column from the right requires slightly different behavior, namely the use of an extra input—the carry from the first column. Here's its truth table.

a1 b1 c1 s1 c2 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 27.3.4 (Display page) Constructing a 32-bit adder out of 1-bit adders Constructing a 32-bit adder out of 1-bit adders

One may observe that the 1-bit adder just designed for the second addition column works for the remaining columns as well. That is, the same structure can be used to produce sk and ck+1 from ak , bk, and ck, for k ranging from 1 to 31. The resulting circuit is diagrammed below.

In fact, the addition in the rightmost column need not be treated as a special case. The adders used for columns 1-31 can also be used in column 0, merely by providing an input of 0 as the carry-in. This circuit is called a ripple adder, since carries "ripple" from the rightmost to the leftmost columns. The possibility of a rippling carry constrains the speed of the adder. P&H section C.6 (fourth edition) or B.6 (third edition) describes an adder design that trades simplicity for speed. One of the very important properties of this adder is that it works for unsigned numbers and signed numbers (2's complement). In fact, it doesn't even need to know which kind it is working on—it does just the same thing in either case. Addition is moving clockwise on the number wheel. Plug in a positive and a negative 4-bit number in the 4-bit adder that you built last week. Convince yourself that it does the right thing. 27.3.5 (Display page) Overflow Overflow

It is very useful to have the adder signal overflow when it occurs. This is one place where signed and unsigned arithmetic differs. In adding unsigned values, detecting overflow is pretty obvious. If there is a carry out from the MSB there was an overflow. This is where you cross from FF...F to 00...0 on the number wheel. The more interesting case is detecting overflow when adding signed integers. If you add a positive and a negative number there cannot possibly be an overflow because the magnitude of A–B (for positive A and B) must be less than the larger of A and B. When you add two positive signed values and an overflow occurs, there is no carry out. What happens instead? Whenever you add two negative numbers there is a carry out since both have the sign bit on, so that is not what tells you there is an overflow. What does? If you remember the number wheel, an overflow occurs when the result of the arithmetic operation crosses from the most positive numbers to the most negative, or vice versa. In other words, overflow occurs when adding two positive numbers produces a negative sum, or vice versa. It is possible to detect the overflow using only the inputs and outputs of the most significant bit slice of your adder. Write down the truth table for the FA that forms the MSB of your adder. Add a sixth column that is overflow. Derive a boolean expression for the overflow. Can you reduce this to a single gate? Add this simple overflow detector to your 4-bit adder. 27.3.6 (Display page) Subtraction Subtraction

In designing our adder and in detecting overflow we reduced the mathematical concepts down to simple logic gates on the bits that are flowing along the wires. We can do a similar thing with subtraction. In your homework this past week you designed a ripple subtractor. Here's another way to tackle the problem. Recall that in 2's complement, the value represented by a series of bits

[Bn-1, Bn-2, ..., B0] is

n-1 n-2 –Bn-12 + Bn-22 + ... + B0. Notice the minus sign associated with the numerical weight of the sign bit! The arithmetic concept of negation, i.e. mapping A to –A, can be implemented on the bit representation by complementing all

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 129 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

the bits (called the 1's complement and adding one to the result). (You can prove this rather easily.) The operation A – B in two's complement arithmetic is equivalent to A + !B + 1. Thus we can use the adder for subtraction as well, by employing a multiplexor as shown below (and in your book).

Negating all the bits in B is easy; we just route each bit through a NOT gate. Adding 1 to !B would seem to require an extra trip through the adder, but recall the design of the ripple adder. We had a carry-in input to the 1-bit adder for the least significant (rightmost) bit and we set it to 0. Instead, we could have done A + B + 1 with the same hardware. That's just what we need now. The diagram below shows how this is

done. 27.3.7 (Display page) An adder/subtractor An adder/subtractor

Modify your 4-bit adder so that it becomes an adder/subtractor. In addition to the two 4-bit data inputs A and B, it should have a one-bit OP input. 0 means add, 1 means subtract. Congratulations, we are well along towards having an ALU. All we need to add is some logic. 27.3.8 (Brainstorm) A more complete ALU In an earlier lab, you implemented a logic unit that implemented four operations (AND, OR, XOR, and EQ) using gates and a MUX. Explain how you would combine this unit and your adder/subtractor into an "arithmetic logic unit", and describe what the OP input would be for the resulting six- operation unit. 27.4 The rest of the MIPS CPU(17 steps) 27.4.1 (Display page) P+H MIPS CPU diagram overview P&H MIPS CPU diagram overview

We now move on to study the rest of the MIPS CPU, diagrammed in P&H Figure 4.17. The diagram contains a lot of wires and four multiplexors. We'll examine these shortly. We've already studied parts of the ALU. The diagram includes several components that implement ALU parts: two other adders, a shifter, and a sign extender. There are two more combinational logic circuits labeled "Control" and "ALU control"; we'll examine these shortly as well. Finally, there are four state elements: the program counter, the registers, and two places to store input from memory or output to memory. One instruction per clock tick will be executed, and all the state elements will get updated at each clock tick. The CPU in the diagram supports eight instructions. Three are I-format: beq, lw, and sw. Five are R-format: add, sub, and, or, and slt. (The j instruction is added last; the diagram for j appears in Figure 4.24.) Here is a review of the instruction formats for these instructions:

All the instructions start with an op code in bits 31-26. The R-format instructions read two registers, specified in bits 25-21 and 20-16, and write into a third register, specified in bits 20-16. There are two kinds of I-format instructions, beq and the two instructions referencing memory. Both kinds involve a signed immediate field in bits 15-0. The beq and sw instructions read values from registers specified in bits 25-21 and 20-16. The lw instruction reads from the register specified in bits 25-21, but writes to the register specified in bits 20-16. The j instruction doesn't reference any registers. It computes an address from bits 25-0 of the instruction.

Execution of each instruction begins identically. The program counter signal goes to a state element that, given an address, retrieves the contents of memory at that address. The various bits of the instruction are routed to the "Control" module, which determines the functionality for each op code, and to the register state element, which selects register values relevant to the instruction. 27.4.2 (Display page) Decoding instructions Decoding instructions

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 130 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

We will deal later with fetching the instructions from memory and updating the PC. For today, we'll work with the 32-bit bus called Instruction in Figure 4.17 as an input. The first thing we do with an instruction is decode it so that we can determine how to execute it. In software, one might do this by using shifts and masks. In hardware we do it by wiring and decode tables. You see this in Figure 4.17 using indices on the bits, such as instruction[31-26] for the OP field. Start a new project called MIPS. Add an 32-bit wide input called Instruction. Use a splitter to crack it open into 32 individual values. Attach a 6-bit wide reverse splitter to bits 31-26 to create an output called OP. Do the same for the fields RS, RT, RD, Funct, and Immed. Notice that the last three overlap, i.e., they use the same bits of the instructions. Some of these fields we will be able to use directly; others we will need to decode further. 27.4.3 (Display page) ALU control ALU control

In Figure 4.17 a 2-bit ALUOp signal that is derived from the 6-bit OP field and the 6-bit Funct field (which is referred to as the opcode field in the caption of the picture) are combined to produce the ALUop signal that specifies what ALU operation is to be performed. A table of signal values (Figure 4.12 in P&H) appears below. function desired ALU ALU control instruction ALUOp operation field action input lw 00 load word don't care add 0010 sw 00 store word don't care add 0010 beq 01 branch equal don't care subtract 0110 add 10 add 100000 add 0010 sub 10 subtract 100010 subtract 0110 and 10 and 100100 and 0000 or 10 or 100101 or 0001 set on less slt 10 101010 set on less than 0111 than Observe that the ALUOp signal specifies whether or not to use the Funct field to direct the ALU; 10 means yes, 00 and 01 mean no. A truth table that derives ALU control values from the Funct field values appears in Figure 4.13 of P&H. You have probably noticed that there are two ways to derive the encoding of the control signals—starting from the encodings of the instruction fields and working towards the actual function units, or starting from the implementation of the function units. In practice, a combination is used. You probably noticed in your ALU design above that it really didn't matter which operators were applied when to which inputs of the multiplexor. You might as well wire it up so that you can essentially use the bits from the instruction to make the selection. 27.4.4 (Brainstorm) Conditions If you look at the ALU in Figure 4.17, you will notice that there is a signal Zero out of the ALU, in addition to the result. This is an output of the datapath used to determine if a branch is taken. How would you modify your ALU above to include a signal that is 1 when all the bits of the ALU output are 0? 27.4.5 (Display page) Registers Registers

The processor's 32 general-purpose registers are stored in a structure called a register file. The register file is a collection of registers in which any register can be read or written by specifying the number of the register in the file. As you saw in earlier activities, a decoder is used to turn each 5- bit register field of an instruction into the corresponding register selector, and then the data in the corresponding register is output for use, say, by the ALU. Here's an example:

10100 => 20 => 32-bit contents of register 20

The register file for the MIPS uses three ports, two for reading and one for writing. The RS field is always used to select the register to be read on port 1. The data value is output on Read Data 1. Thus, bits [25-21] can be wired directly to Read Register 1. Similarly, the RT field is always used for Read Register 2 and the corresponding value is available at Read Data 2. Finally, the write register is sometimes specified by RD (as in a R-type instruction) and sometimes by RT (as in an I- type) instruction. 27.4.6 (Brainstorm) Register file modification How would you modify your register file design from the previous lab to have two read ports and one write port as in Figure 4.17? 27.4.7 (Brainstorm) Synchronization P&H note (in Figure 4.7) the need to coordinate the reading of a register's value with the replacement of the value in that register. During what portion of the cycle are the read values used and when is the write value utilized? What happens if the read register and the write register happen to be the same? 27.4.8 (Brainstorm) Distinguishing R-type and I-type instructions Determine as efficiently as possible whether the instruction is R-type or I-type. 27.4.9 (Display page) Sign extension Sign extension

You have run into sign extension several times in this course. In C, when a short integer is stored http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 131 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

into a larger integer, it needs to be sign-extended or negative values would accidentally become positive. In the hardware it shows up in how the 16-bit immediate is extended to a 32-bit value before reaching the ALU. Build a sign extender in Logisim. We'll scale it down a bit from the MIPS. It should have a 8-bit data input In, a control input X (0 means zero extend, 1 means sign extend) and a 16-bit data output Out. It is mostly splitters, but you'll need a little logic to select the bit. 27.4.10 (Display page) Register transfers Register transfers

The meaning of each instruction is defined by the register transfers that take place when it is executed. These transfers are carried out by the data path. So, to determine what needs to be connected to what, we write down all the register transfers that we need to do. Then we make sure that we can do them all. Finally, we design the control logic so that the control signals cause the right transfers to take place. Look back at the subset of instructions in the ALU control table above. What is the register transfer required for the R-type instructions—add, sub, and, or, slt? They all have the form R[rd] = R[rs] op R[rt].

instruction operation Register Transfer lw load word sw store word beq branch equal add add R[rd] = R[rs] + R[rt] sub subtract R[rd] = R[rs] - R[rt] and and R[rd] = R[rs] & R[rt] or or R[rd] = R[rs] | R[rt] slt set less than R[rd] = R[rs] < R[rt]

Thus, we can determine that Read Data 1 needs be connected to ALU A and Read Data 2 needs to be connected to ALU B. 27.4.11 (Brainstorm) Register transfers for ADDI What are the register transfers for the Add Immediate (not in the table) and the rest of the operations in the table? 27.4.12 (Display page) ALU inputs ALU inputs

We are now ready to connect the Register file and the ALU. With your completion of the register transfer table from the previous step, write down where the A input to the ALU comes from and where the B input for the ALU comes in each case. You will find that there is one choice for the A side and two choices for the B side. Hence, the multiplexor in Figure 4.17. Can you determine what goes into the signal ALUsrc? What instructions would involve selecting the 0 input and what ones would involve selecting the 1 input? 27.4.13 (Display page) Register write data input Register write data input

In completing the register transfers, you introduced the remaining major element of the data path— the data memory. This is what is accessed by Load Word and Store Word. The address is computed just like an add immediate. A value is either loaded from that address into a register or written to that address from a register. The complete RTL is given below.

instruction operation Register Transfer lw load word R[rt]=Mem[R[rs]+SX(im16)]; PC = PC+4 sw store word Mem[R[rs]+SX(im16)] = R[rt]; PC = PC+4 beq branch equal PC = R[rs] == R[rt] ? PC+4 + SX(im16) : PC+4 add add R[rd] = R[rs] + R[rt]; PC = PC+4 sub subtract R[rd] = R[rs] - R[rt]; PC = PC+4 and and R[rd] = R[rs] & R[rt]; PC = PC+4 or or R[rd] = R[rs] | R[rt]; PC = PC+4 slt set less than R[rd] = R[rs] < R[rt]; PC = PC+4

Now we can finish the job of wiring up the ALU and the register file and the memory.

What are the possible uses of the ALU result? Mem Address and the Register File. What are the possible data inputs to the register file? ALU result and Mem Read Data. There are two, so we need a MUX to select between them. What does the write data come from? Read Data 2, i.e. R[rt].

27.4.14 (Display page) Deriving control signals Deriving control signals

Now that we have verified that everything that needs to get somewhere has a way to get there, we need to determine how to set the various control signals to actually get them there. To do this, we add a column to our table for each http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 132 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

control signal in the data path. (We'll let you do the PC mini-ALU.) Most of these are MUX selectors. We just fill them in according to the RTL. (We have a symbolic version of the ALUop.) You should fill in the rest of the table.

ALU inst operation Register Transfer SX ALUSrc MemRead MemToReg MemWrite RegWrite op R[rt]=Mem[R[rs]+SX(im16)]; lw load word 1 1 add 1 1 0 1 PC = PC+4 store Mem[R[rs]+SX(im16)] = sw word R[rt]; PC = PC+4 branch PC = R[rs] == R[rt] ? PC+4 beq equal + SX(im16) : PC+4 R[rd] = R[rs] + R[rt]; PC add add x 0 add 0 0 0 1 = PC+4 R[rd] = R[rs] - R[rt]; PC sub subtract = PC+4 R[rd] = R[rs] & R[rt]; PC and and = PC+4 R[rd] = R[rs] | R[rt]; PC or or = PC+4 set less R[rd] = R[rs] < R[rt]; PC slt than = PC+4

27.4.15 (Self Test) Control signals for lw The lw instruction has op code = 0x23, source register in bits 25-21, destination register in bits 20- 16, and offset in bits 15-0. The Branch and MemWrite signals are 0; the MemRead signal is 1. What's the value of the MemtoReg signal?

What's the value of the RegDst signal?

What's the value of the RegWrite signal?

What's the value of the ALUSrc signal?

27.4.16 (Brainstorm) beq analysis Explain the processing of the beq instruction, in the upper right corner of Figure 4.17. In particular, explain the purpose of the AND gate, and why an ALU output of "zero" might be relevant. 27.4.17 (Brainstorm) Stuck signal A common bug in circuitry is for a wire to be always "on". What would be the effect of having the ALUSrc wire always on? In particular, which instructions would no longer work correctly? 27.5 Homework(1 step) 27.5.1 (Display page) Project 3 checkoff Project 3 checkoff

In the Thursday/Friday labs, t.a.s will want to see working versions of all of the following:

the register file, the ALU, the rotate-right unit, and the basic decoder.

28 MIPS CPU: sequencer and control 2009-4-02 ~ 2009-4-03 (4 activities) 28.1 Project 3 milestones(1 step) 28.1.1 (Display page) Checkoff requirements Checkoff requirements

This checkpoint is worth up to 6 homework points. For full credit, you must have working versions of all of the following:

the register file; the ALU; the rotate-right unit; and the basic decoder.

You get 3 points for the third of these that works, and 3 more for the fourth. (Last lab's checkoff was for the first two.) 28.2 Sequencer(5 steps) 28.2.1 (Display page) Review Review

In the last lab, we investigated the datapath and control of the MIPS CPU. Today, we'll examine the instruction sequencer, which determines the sequence in which instructions get executed. We'll also analyze the critical path through the CPU, in anticipation of treatment of a pipelined CPU in upcoming lab activities.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 133 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

28.2.2 (Display page) Instruction sequencer Instruction sequencer

PC holds the address of the currently executing instruction. This address is provided to the instruction memory to fetch that instruction.

instruction = instMem[ PC ]; This portion takes place for all instructions. The instruction is then decoded into control signals, as above, and executed by perfoming the associated register transfers on the data path. In addition, each instruction must perform the register transfer to update PC . For sequential instructions, this is

PC = PC + 4 Take a look at the MIPS instruction set described in the book and on the green card. What are all the updates to PC that it requires? Branch PC = PC + 4 or PC = PC + 4 + BranchAddr Jump PC = JumpAddr Jump Register PC = R[rs] Jump And Link PC = JumpAddr Take a look at Figure 4.17. What are all the register transfers on PC that the sequencer portion of the data path can perform?

PC = PC + 4 PC = PC + 4 + signExtend(Inst[15-0]) << 2 The one that takes place is determined by the select on the MUX in the upper right corner. The latter is selected if the instruction is a branch and the zero output of the ALU is 1. On this datapath the beq instruction is implemented by causing the ALU to perform R[rs] - R[rt] and allowing the Zero condition to determine if the branch is taken ( PC = PC + 4 + signExtend(Inst[15-0]) << 2). 28.2.3 (Brainstorm) Other branches Can you implement the BNE instruction on this datapath as is? If so, how? If not, what modification would you make so that you could implement that instruction? 28.2.4 (Display page) Implementing j (Jump) Implementing j (Jump)

We move to the j (Jump) instruction. Adding it to the instruction set requires a different approach, since it uses neither registers nor the ALU nor memory access. Its sole effect on the CPU's state elements is to change the PC. (Recall that the new PC address is the concatenation of the top four bits of the current PC+4, bits 25-0 of the instruction, and 00.) Figure 4.24 shows what's needed:

shift-left-2 logic, and a wire supplying bits 25-0 of the instruction to the shifter; logic to concatenate the top four bits of PC+4 to the shifted address; a new control signal named "Jump", and a multiplexor that uses Jump to select either the computed branch address or the jump address as the new PC.

The new wires and logic appear in solid black in Figure 4.24. (Note: the Figure 4.24 online and in some copies of the textbook has a bug in the added portion of the diagram. Can you identify it?) We add the additional column to our table to contain the control point for the sequencer. Note that all the sequential instructions need to set this control point to 0 so that the MUXes in the sequencer give us PC = PC + 4. (Entries in the truth table marked "x" mean "don't care".)

Register ALU RegDst Br J inst operation ALUSrc MemRead MemToReg MemWrite RegWrite Transfer op R[rt]=Mem[ 0 0 0 lw load word R[rs]+SX(im16) 1 00 1 1 0 1 ]; PC = PC+4 Mem[ x 0 0 store R[rs]+SX(im16) sw 1 00 0 x 1 0 word ] = R[rt]; PC = PC+4 PC = R[rs] == x 1 0 branch R[rt] ? PC+4 + beq 0 01 0 x 0 0 equal SX(im16) : PC+4 R[rd] = R[rs] 0 0 add add + R[rt]; PC = PC+4 R[rd] = R[rs] 1 sub subtract - R[rt]; PC = PC+4 R[rd] = R[rs] and and & R[rt]; PC = 0 10 0 0 0 1 PC+4 R[rd] = R[rs] http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 134 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

or or | R[rt]; PC = PC+4 R[rd] = R[rs] set less slt < R[rt]; PC = than PC+4 addi add R[rt] = R[rs] 1 00 0 0 0 1 0 0 0 immediate + SX(im16) j jump PC = JumpAddr x x x x 0 0 x 0 1

28.2.5 (Brainstorm) Other jumps Does this datapath have the capability to support Jump-and-Link? What about Jump-Register? If so, how? If not, what would you need to add to be able to do that? 28.3 Control(4 steps) 28.3.1 (Display page) Basic control logic Basic control logic

Our tables provide a complete specification for the control logic. We simply treat it as a truth table with the opcode (bits 31-26 of the instruction) as the input and the control points as the output. For our subset of the MIPS this is as shown below. Recall that for R-type instructions the opcode is 000000 and the actual function being performed by the ALU is determined by the Funct field. The flow of information through the datapath is the same for all the R-type instructions—only the operation within the ALU changes.

Register Inst[31- ALU RegDst Br J inst ALUSrc MemRead MemToReg MemWrite RegWrite Transfer 26] op R[rt]=Mem[ 10 0011 0 0 0 lw R[rs]+SX(im16) 1 00 1 1 0 1 ]; PC = PC+4 Mem[ 10 1011 x 0 0 R[rs]+SX(im16) sw 1 00 0 x 1 0 ] = R[rt]; PC = PC+4 PC = R[rs] == 00 0100 x 1 0 R[rt] ? PC+4 + beq 0 01 0 x 0 0 SX(im16) : PC+4 R[rd] = R[rs] 00 0000 0 0 add + R[rt]; PC = PC+4 R[rd] = R[rs] 1 sub - R[rt]; PC = PC+4 R[rd] = R[rs] and & R[rt]; PC = 0 10 0 0 0 1 PC+4 R[rd] = R[rs] or | R[rt]; PC = PC+4 R[rd] = R[rs] slt < R[rt]; PC = PC+4 addi R[rt] = R[rs] 00 1000 1 00 0 0 0 1 0 0 0 + SX(im16) j PC = JumpAddr 00 0010 x x x x 0 0 x 0 1

28.3.2 (Brainstorm) Logic for the RegWrite signal Explain how you would determine a boolean expression for the RegWrite signal. 28.3.3 (Brainstorm) Critical path Recall that the cycle time of a digital design must be longer than the sum of the propagation delays along the longest path from the output of a state element to the input of a state element. Assuming that a memory access is 10 units, a register file access is 2, and ALU operation is 2, and all of the other MUXes, adders and control blobs are 1, which instruction has the longest critical path? How long is the critical path? Briefly explain your answer. 28.3.4 (Display page) Multicycle datapath and control Multicycle datapath and control

So far we have built a datapath that implements the entire instruction cycle (instruction fetch, decode, operand fetch, execute, memory, result store) as one huge blob of combinational logic. Every time the clock ticks, the PC gets updated, and registers may get updated or memory may get updated. What is the cycle time for such a design? Huge! We can reduce the cycle time by adding additional registers, and then each instruction will take multiple clock cycles. For example, we can separate instruction fetch from instruction execution by placing a register, the Instruction Register http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 135 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(IR), on the output of the instruction memory. This means all instructions are implemented by two consecutive register transfers: Instruction Fetch: IR = InstMem[PC] Instruction Execute: PC, R, Mem updated according to the table given earlier. Where in Figure 4.24 would you insert the IR register? Right on the outputs of the Instruction Memory. But note that this creates a little bit of a problem. The instruction executes in two clocks (fetch, execute) but the sequencer is updating PC every clock. We need to place another register, nPC, on the output of the PC+4 adder. We take turns updating one and then the other. So our instruction execution is like this. Instruction Fetch: IR = InstMem[PC], nPC = PC+4 R-type Execute: PC = nPC, R[rd] = R[rs] op R[rd] j Execute: PC = JumpAddr beq Execute: PC = R[rs] == R[rt] ? nPC + BrAddr : nPC The controller would need to keep one bit of state indicating the stage of processing: instruction fetch or execute. We'll get more experience with this in our CAL16 implementation in project 3. 28.4 Homework and readings (2 steps) 28.4.1 (Display page) Homework Homework

Peer evaluation of homework assignment 8 is due on Monday night, April 6, at 11:59pm. There will be another project checkoff in lab on April 9 and 10. 28.4.2 (Display page) Readings Readings

To prepare for next week's activities, read P&H sections 4.5 through 4.8 (fourth edition) or sections 6.1 through 6.6 (third edition). 29 A pipelined architecture 2009-4-07 ~ 2009-4-08 (3 activities) 29.2 Use pipelining to speed up throughput.(15 steps) 29.2.1 (Display page) Two ways to speed up execution Two ways to speed up execution

One way to speed up the CPU is to take advantage of the fact that some MIPS instructions can skip certain stages of execution. For instance, jumps, branches, and R-format instructions don't need to access memory, and jumps and branches in addition don't need to write into registers. This approach allows us to speed up instruction latency—the time needed to execute individual instructions. Another way to speed up execution is to overlap instruction processing. All of the following can potentially be happening simultaneously in a sequence of five instructions:

one instruction's write to a register; the next instruction's write to memory; the third instruction's use of the ALU; decoding of the fourth instruction; and fetching of the fifth instruction from memory.

While this overlapped processing doesn't speed up any individual instruction, it potentially can improve execution throughput—time to execute a long sequence of instructions—by as much as 500%. This pipelined organization is like an assembly line; while one team of automobile builders is fitting together a car's chassis, a second might be installing the engine in a second car, a third putting the seats inside a third car, and so on. Each stage or step in the pipeline takes a single clock cycle. Thus the clock cycle must be long enough to accommodate the longest stage. The diagram below illustrates the operation of a four-stage pipeline performing as well as possible; each box represents a stage, and the number in a box says which stage is being executed.

29.2.2 (Self Test) How much speedup? Suppose that each stage of a five-stage pipeline takes 1 time unit. What is the minimum number of time units necessary to complete 100 instructions?

Suppose that the second pipeline stage requires 2 time units. What's the minimum number of time units necessary to complete 100 instructions?

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 136 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

29.2.3 (Display page) The MIPS pipeline The MIPS pipeline

The stages of the MIPS pipeline are listed below (abbreviated titles used in P&H appear in parentheses):

1. instruction fetch (IF); 2. instruction decode and register file read (ID); 3. execution or address calculation, using the ALU (EX); 4. data memory access (MEM); 5. "write back" to a register (WB).

The figure below, Figure 4.35 (6.11 in the third edition) in P&H, illustrates the pipeline stages. The dark bars are collections of registers that save information from one stage that's needed in a subsequent stage.

The diagram below compares the performance of a single-cycle organization of the mini-MIPS CPU (studied in earlier labs) with its pipelined counterpart. We assume that memory access and ALU operation both take 200 picoseconds, and register file reads and writes take 100 picoseconds. (A sequence of lw instructions is used for comparison since loads are active in all five stages.) We also assume that the write to the register file occurs in the first half of the clock cycle and the read from the register file occurs in the second half. (We'll shortly see why.)

29.2.4 (Self Test) Stage for branch? Consider the stages illustrated in Figure 4.35 in the previous step. In which stage is the PC updated as a result of a successful branch?

29.2.5 (Display page) An example An example

P&H illustrate the execution of the five instructions

lw $10,20($1) sub $11,$2,$3 add $12,$3,$4 lw $13,24($1) add $14,$5,$6 below. Again, writes to the register file occur in the first half of the clock cycle, while reads occur in

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 137 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

the second half. We think this representation is a particularly good one for keeping track of what instructions are doing at each stage. 29.2.6 (Brainstorm) Disrupting the pipeline A disruption of the pipeline called a hazard occurs when an instruction must wait for some other instruction to complete. A particular kind of hazard called a data hazard occurs when one step must wait for a register write to complete. Consider the instruction sequence given in the previous step (listed below for your convenience), and change a single argument of one of the instructions to cause a data hazard. Then describe your change and briefly indicate why it produces a hazard. Drawing a figure like the one in P&H Figure 4.43 (6.19 in the third edition) may be helpful.

lw $10,20($1) sub $11,$2,$3 add $12,$3,$4 lw $13,24($1) add $14,$5,$6 29.2.7 (Display page) More on hazards More on hazards

Data hazards, as we just noticed, result from too few steps between the writing of a register and its use. (The pairing of a data value definition and its use is called a data dependency.) The instruction that uses the new value must wait for it to appear in the register, thus stalling the pipeline. Another kind of hazard is a control or branch hazard produced by a jump or a branch. The problem here is that while a jump or branch is getting decoded in stage 2, an instruction—the wrong one!—is being fetched in stage 1. Without special handling, this will also stall the pipeline. A third type of hazard, not encountered in the MIPS architecture, is a structural hazard. This occurs when the hardware cannot support the combination of instructions to be executed in a given clock cycle. To avoid structural hazards, each logical component of the datapath—such as instruction memory, register read ports, ALU, data memory, and register write port—can only be used within a single pipeline stage. 29.2.8 (Self Test) Amount of stalling? Which of the following program segments can be executed without stalling (assuming no hardware is added to deal with the possibility of stalling)? Remember that register write happens in the first half of the clock cycle, while register read happens in the second half. segment 1 segment 2 segment 3 lw $10,20($1) lw $10,20($1) lw $10,20($1) sub $11,$2,$3 sub $11,$2,$3 sub $11,$2,$3 add $12,$3,$10 add $12,$3,$4 add $12,$3,$4 lw $13,24($1) lw $13,24($10) lw $13,24($1) add $14,$5,$6 add $14,$5,$6 add $14,$5,$10

29.2.9 (Brainstorm) Hazards from branches? A CS 61C student claims that a branch instruction poses only control hazards to the pipeline, not data hazards. Do you agree or disagree? Briefly explain your answer. 29.2.10 (Display page) Dealing with control hazards Dealing with control hazards

Branches cause a problem. We previously assumed that the decision about whether or not to take a branch was made in the Execute step (via comparison of the two operands in the ALU). By that time, the two subsequent instructions would have been fetched; processing must stall to await the decision about the branch. The figure below illustrates the situation.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 138 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What's to be done? Programmers, or the assembler, could be required to include two no-op instructions after each branch. But this is surely a bad idea since branches occur so often in programs. A better solution is to add some hardware, namely a comparator in the instruction decoding stage. MIPS branches don't require complicated decision making—are two register values equal? is a value less than 0?—so the extra hardware won't be too expensive. Moreover, it allows the branch decision and PC update to be made one stage earlier, as shown in the figure below.

The second solution is to redefine the behavior of a branch. The old definition was, if the branch is taken, the instruction immediately following the branch isn't executed by accident. The new definition is that, whether or not, the branch is taken, the single instruction immediately following the branch is executed also. This behavior is called a delayed branch; the address of the instruction immediately following the branch is called the delay slot. (Jump instructions also have a delay slot, and use it in the same way as branches.) Note that in the worst case, one can always put a no-op in the delay slot. Occasionally, however, one can find an instruction preceding the branch that can be moved to the delay slot without affecting the program's overall behavior. An example appears below. Version 1 Version 2 or $8,$9,$10 add $1,$2,$3 add $1,$2,$3 sub $4,$5,$6 sub $4,$5,$6 beq $1,$4,exit beq $1,$4,exit or $8,$9,$10 # always executed nop xor $10,$1,$11 xor $10,$1,$11 exit: exit: One might note that the resulting code is harder to read and understand. Compilers are better able to http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 139 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

deal with delay slots than people, and in general reordering instructions is a common method of speeding up programs. 29.2.11 (Brainstorm) Delay slot filling The example from the previous step, involving filling the branch delay slot with an or instruction, appears below. Could any of the other instructions have been moved into the delay slot? Explain why or why not. Version 1 Version 2 or $8,$9,$10 add $1,$2,$3 add $1,$2,$3 sub $4,$5,$6 sub $4,$5,$6 beq $1,$4,exit beq $1,$4,exit or $8,$9,$10 # always executed nop xor $10,$1,$11 xor $10,$1,$11 exit: exit: 29.2.12 (Display page) Some real code Some real code

You may wish to work with a labmate on these exercises. Copy the program loops.c from the ~cs61cl/code directory. Here's the program.

#include #include

int main ( ) { int a, b; int k1, k2, sum; char symbol[10]; fgets (symbol, 10, stdin); a = atoi (symbol); fgets (symbol, 10, stdin); b = atoi (symbol); sum = 0; for (k1=0; k1

Compile it twice, once without optimization, the other with. (Recall that the command-line option - O1—minus big-oh one—produces level-one optimization.)

mips-gcc -c loops.c -o loops.opt0.o mips-gcc -c loops.c -O1 -o loops.opt1.o Then examine the assembly code:

mips-objdump -S loops.opt0.o > loops.opt0.disasm mips-objdump -S loops.opt1.o > loops.opt1.disasm In each, identify the assembly language equivalent of the nested loops. Also identify the locations of a, b, k1, k2, and sum relative to $s8 in the unoptimized code, and determine which registers are used to store these variables in the optimized code. The next few steps ask you to answer some questions about delayed branches. 29.2.13 (Brainstorm) Filled delay slots The nop instructions occurring in the unoptimized code are no-ops. Not all the delay slots contain no-ops, however. Find one that doesn't, and explain why it's OK for the instruction in the delay slot always to be executed regardless of whether or not the branch is taken. 29.2.14 (Brainstorm) Effects of optimization Now examine the optimized code, in particular, the nested loops. Two of the branch delay slots have been filled in. Explain why it's OK in each case for the instruction in the delay slot always to be executed regardless of whether or not the branch is taken. 29.2.15 (Brainstorm) More optimization Fill the delay slot at location D8 with something that's not a no-op. You will also need to change the target address in the branch at location D4. Explain your optimization and why it works. 29.3 Homework and reading(1 step) 29.3.1 (Display page) Project 3 milestone Project 3 milestone

Checkpoint 3 for project 3 will happen in lab on April 9 and 10. You will earn up to 6 homework points by being able to execute instructions. 30 More on the MIPS pipeline 2009-4-09 ~ 2009-4-10 (3 activities) 30.1 Project 3 checkoff(1 step) 30.1.1 (Display page) The third project 3 checkoff The third project 3 checkoff

To earn full credit (6 points) in this milestone, you should be able to execute all the instructions that don't jump or branch. To earn any credit at all, you need to have the top-level components—register file, ALU, rotator, decoder, memory, and instruction register—debugged and wired together. Partial credit will be awarded based on which instructions or parts of instructions you can handle, and on

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 140 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

how much of the control you have implemented. Implementation of the sequencing circuitry implemented and the ability to do a branch or a jump (part of checkpoint 4) can substitute for missing aspects of other instructions. 30.2 Explore the effects of using forwarding.(15 steps) 30.2.1 (Display page) Review of ways to disrupt the pipeline Review of ways to disrupt the pipeline

In the last lab, we discussed hazards, ways to disrupt the five-stage MIPS pipeline:

conflicting use of components, for example, reading from two memory locations at the same time (structural hazards); jumps and branches (control hazards); and information needed before it's ready (data hazards).

Structural hazards are not so big a concern, since the MIPS instruction set was designed to be pipelined. All the use of the ALU comes in a single stage, for example; where other arithmetic operations are necessary, they go through separate circuitry. Thus there is no conflict for the ALU's use. We have, however, assumed that both an instruction memory and a data memory are available. If these were combined into a single memory, there would be a potential structural hazard involving instruction fetching in stage 1 and memory read/write in stage 4. Control hazards in the MIPS are dealt with in two ways. First, more hardware is added to stage 2 (instruction decoding) to make the decision to branch more quickly. Also, each jump or branch is really delayed, in that the instruction immediately following a jump or branch is always executed. Data hazards occur when a value from memory or a register isn't available yet when a subsequent instruction needs to use it. Some example code follows.

add $2,$5,$4 add $4,$2,$5 sw $5,100($2) add $3,$2,$4 Here is how the program segment above would be executed in a MIPS CPU without any extra hardware to handle data hazards.

Processing in the first two stages is straightforward. Suppose we start at time unit 0. Then the first add is fetched at time 0; at time 1 the first add is decoded and the second add is fetched. At time 2, the first add happens in the ALU, the second add is decoded, and the sw is fetched. However, the second add needs to access $2, but it's not ready yet. This stalls the pipeline. At time 3, the first add is idle since it doesn't access memory. The second add is still waiting for $2, so it stalls again. The sw is similarly waiting for the second add to get out of the way. Finally at time 4, the first add writes $2 in the first half of the clock cycle, and the second add can then read it. In time units 5 and 6, the second add can proceed through stage 4 and the sw through stage 3. The third add, however, must wait a time unit for $4 to be written by the second add— another stall.

A diagram of this process appears below. Stalls, indicated by items in red, are referred to as "bubbles" in P&H.

30.2.2 (Brainstorm) Detecting a hazard Stalling the pipeline till the hazard clears, like you saw in the previous activity, does not happen by magic. Control logic has to detect that the hazard is present and cause the stall to occur. This and the next question ask you to think about how this happens. A data hazard is present if the destination register of an earlier incomplete instruction is the same as one of the source registers. Imagine that you carry the instruction down the pipeline and decode the control signals at each stage. When is the destination register field actually used? How can you tell if a data hazard is present? 30.2.3 (Brainstorm) Implementing a stall How might you cause the stall of the IF and ID stages until it clears? How do you know that it has cleared? 30.2.4 (Display page) Forwarding Forwarding

One approach to dealing with data hazards is to rely on the compiler to rearrange instructions to remove the hazards. P&H note, however, that "these dependencies happen just too often and the delay is just too long to expect the compiler to rescue us from this dilemma." A solution is to add wires and control that forward information to instructions in earlier stages. The typical destination for

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 141 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

forwarded information is the ALU; typical forwarded values are a previous ALU result or a value from memory. Here's an illustration:

This eliminates stalls resulting from consecutive R-format instructions. Some stalls are still unavoidable, however. Consider the following program segment:

lw $10,0($14) add $9,$10,$11

Here, the value in $10 is not available until stage 4. The diagram below shows the problem—we can't

"forward" backward in time! A stall is thus required for the load-use hazard when the loaded register is used in the next instruction. Early MIPS computers required the compiler and assembler to avoid the load-use hazard! In current models, the hardware detects the hazard and stalls in (thus saving the space for a no-op instruction). 30.2.5 (Brainstorm) Cycle counting How many cycles are required in each iteration of the following loop, assuming that branch decisions are made in stage 2 but assuming no forwarding? Briefly explain your answer.

loop: lw $t0,0($s1)​ addu $t0,$t0,$s2 ​ sw $t0,0($s1) addiu $s1,$s1,-4​ bne $s1,$0,loop​ nop 30.2.6 (Brainstorm) Cycle counting with forwarding How many cycles are required in each iteration of the following loop, assuming that branch decisions are made in stage 2 and that values are forwarded from ALU and the memory unit as just described? Briefly explain your answer.

loop: lw $t0,0($s1)​ addu $t0,$t0,$s2 ​ sw $t0,0($s1) addiu $s1,$s1,-4​ bne $s1,$0,loop​ nop 30.2.7 (Display page) Compiler optimizations Compiler optimizations

Two optimizations performed by a compiler on the MIPS architecture involve making use of the branch and load delay slots. Sometimes that's easy: lw $t1,0($t0) lw $t1,0($t0) add $t3,$t1,$t2 becomes lw $t2,4($t0) lw $t2,4($t0) add $t3,$t1,$t2 Control hazards require a bit more cleverness. loop: loop: lw $2,100($3) lw $2,100($3) addi $4,$4,1 becomes addi $4,$4,1 addi $3,$3,4 beq $2,$4,loop beq $2,$4,loop addi $3,$3,4 nop Another example appeared in the last lab: loop: loop: addiu $a1,$a1,1 addiu $v1,$v1,1 addiu $v1,$v1,1 becomes slt $v0,$v1,$a0 slt $v0,$v1,$a0 bnez $v0,loop bnez $v0,loop addiu $a1,$a1,1 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 142 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

addiu $a1,$a1,1 nop 30.2.8 (Brainstorm) An optimizing exercise (1) Consider the following program segment.

loop: lw $t0,0($s1)​ nop ​ addu $t0,$t0,$s2​ sw $t0,0($s1)​ addiu $s1,$s1,-4​ nop bne $s1,$0,loop​ nop Which instruction can be moved into the load delay slot? (Hint: A move that requires changes to other instructions is OK, provided that no instructions are added.) Briefly explain your answer. 30.2.9 (Brainstorm) An optimizing exercise (2) Rearrange your solution to the previous step so that only five instructions per loop iteration are executed. Briefly explain how you produced your answer. 30.2.10 (Brainstorm) A pessimizing exercise Rewrite the following code to minimize performance—that is, reorder the instructors so that this sequence takes the most clock cycles to execute while still obtaining the same result—and briefly explain your answer. Assume that the hardware provides forwarding, but a stall happens on a use following a load.

loop: lw $2,100($6) lw $3,200($7) add $4,$2,$7 add $6,$3,$5 lw $7,300($8) sub $8,$4,$6 beq $7,$8,loop 30.2.11 (Brainstorm) Appropriateness of instructions for the pipeline Consider the following instruction, a candidate for inclusion in the MIPS hardware instruction set.

addm $rdest,offset($r1) = "ADD Memory" The effect of the addm instruction is to add the contents of the specified memory location to the $rdest register.

Can the addm instruction be easily implemented in a way that's consistent with the existing MIPS pipeline? (Adding big components like another ALU or memory port isn't allowed.) Your answer should be brief, but should provide all the details of how you would implement the instruction or why a simple implementation isn't possible. 30.2.12 (Brainstorm) Another proposed instruction Consider the following instruction, another candidate for inclusion into the MIPS instruction set.

lwz $rdest, offset($r1) " Load Word if Zero"

This instruction tests to see if $rdest is 0. If not, it does nothing; if so, it does a load into $rdest, using regs[$r1]+offset as effective address. Can the lwz instruction be easily implemented in a way that's consistent with the existing MIPS pipeline? (Adding big components like another ALU or memory port isn't allowed.) Again, your answer should be brief, but should provide all the details of how you would implement the instruction or why a simple implementation isn't possible. 30.2.13 (Display page) MIPS history MIPS history

MIPS pipeline characteristics have differed on the various models.

R2000 (1985): 5-stage pipeline R3010 (1988): 6-stage pipeline (splitting decode and register file access) R4000 (1992): 8-stage pipeline (2 for instruction fetch, 3 for data memory access; extra stages involve cache handling R5000 (1996): two 5-stage pipelines, one for integer operations and one for floating point

Later MIPS processors incorporated advanced pipeline techniques:

Branch prediction (P&H section 4.8 [6.6]). This technique uses a small table of bits indexed by the last few bits of the word address. A simple approach is to store a 1 in this table if a branch at that address was taken, and a 0 if not. An improvement is to use two prediction bits. Branch prediction seems to work well even on architectures without delayed branching. Out-of-order execution (P&H section 4.10 [6.9]). "Reservation stations" hold instructions that can't yet be executed. A "commit unit" determines if it's safe to put a computed result into the register file or to store it into memory. "Superscalar" execution (P&H section 4.19 [6.9]). This technique fetches more than one instruction at a time, and uses multiple pipelines.

Some more MIPS history: The R8000 (1994) was the first superscalar MIPS design, with two load/store units, two ALUs, one shifter, one integer multiply/divide unit, and two floating-point units. Any combination of four of the functional units could operate concurrently. The R10000 (1995) had five separate execution units. It could do out-of-order execution. It had a 5-stage pipeline for integer operations, a 6-stage pipeline for load and store instructions, and a 7-stage floating point pipeline. (from Wikipedia) Recent designs have all been based upon R10000 core. The R12000 used

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 143 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

improved manufacturing to shrink the chip and operate at higher clock rates. The revised R14000 allowed higher clock rates with additional support for DDR SRAM in the off-chip cache [we'll talk about caches this coming week], and a faster front side bus clocked to 200 MHz for better throughput. Later iterations are named the R16000 and the R16000A and feature increased clock speed, additional L1 cache, and smaller die manufacturing compared with before. 30.2.14 (Brainstorm) Pipeline length? Pipelines, as we have seen, increase throughput by executing stages of several instructions in parallel. The clock can be faster, since presumably there is less logic in a single stage than in multiple stages. What limits the benefits of pipelining? Why not have eight or ten or twenty pipeline stages? Give at least two factors that suggest that increasing the number of stages will reach a point of diminishing returns. 30.2.15 (Brainstorm) Simplifying the CPU Suppose that the ability to specify an offset for a load and a store were removed. Thus to get the effect of the instruction lw $5,16($9), a programmer would be required to provide the two instructions

addi $9,$9,16 lw $5,($9)

a. A simpler instruction format ought to enable us to get rid of some circuitry. What parts of the MIPS CPU are no longer necessary? b. How can the pipeline be compressed to take advantage of this change?

30.3 Homework and readings(2 steps) 30.3.1 (Display page) Readings Readings

Next week's lab activities will be based on material covered in P&H sections 5.1-5.3 (fourth edition) or sections 7.1-7.3 (third edition). 30.3.2 (Display page) Upcoming events Upcoming events

The fourth project 3 checkoff will be April 16-17. There will be an exam on Wednesday, April 15. If there are aspects of topics we've covered so far that you're unsure of, prepare some questions for your t.a. for the Tuesday/Wednesday lab. 31 Caches 2009-4-14 ~ 2009-4-15 (3 activities) 31.2 Cache design concepts(18 steps) 31.2.1 (Display page) The memory hierarchy The memory hierarchy

The web site for your favorite computer vendor is selling a hot new item, "This machine will run billions of instructions per second!". It has a 3 GHz CPU clock speed (that's 0.33 nanoseconds per cycle) and 40-60 nanoseconds memory access time. Having taken CS 61CL, you are not settling for marketing hype. So you do a little calculation. Every instruction is fetched from memory and many of them (about 30%) load from or store to memory. Let's call it 1.3 memory accesses per instruction at 50 ns per memory access. That's 80 ns just on memory, or 12.5 million instructions per second! (I.e., not billions.) Is this false advertising? How can so fast an execution speed be claimed with such slow memory access rates? The answer is that there's something between the CPU and main memory called a cache, which we'll describe in a minute. First, we'll describe the memory hierarchy.

At the top is the CPU. It holds a small amount of code and data in registers that can be accessed in a fraction of a nanosecond—a single cycle. Further down is main memory. It has more capacity than the registers, but is still limited. Probably most personal computers used today have a gigabyte of main memory. Lower yet are various kinds of secondary storage, such as hard disks, USB sticks, DVDs, and tapes. A disk drive, for example, has huge capacity—a trillion bytes for around $600 these days—but is also quite slow, with access time in milliseconds.

This hierarchy is illustrated below. In general, the lower the level, the higher the latency (access

time), and the lower the cost per bit. The cache fits in just below the CPU. It's on the chip with the CPU, and so is quickly accessible; access time ranges from 0.5 to 5 nanoseconds. There are levels of cache—level 1, level 2, and so on—that differ in size and in the speed of the bus connecting them to the CPU. Modern personal computers have 64-256 Kb level 1 caches for instructions and data, a 4 Mb level 2 cache, with perhaps a 256 Mb level 3 cache. 31.2.2 (Display page) How caches are used

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 144 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

How caches are used

Here's an analogy for how the memory hierarchy is used. You're writing a term paper—the "processor"—at a table in your dorm. Doe Library is equivalent to a disk; it has essentially limitless capacity, but it takes a long time to retrieve a book. You have several good books on your bookself. The bookshelf is like main memory. Its capacity is smaller, which means you have to return a book when the shelf fills up. However, it's easier and faster to find a book there than to go to the library. The open books on the table are like the cache. Capacity is much less—an open book takes up more space than a closed one on the shelf, and you may have to shelve a book when too many others are open—but it's much faster to retrieve data. This organization creates the illusion of the whole library open on the tabletop: we keep as many recently used books as possible open on the table, since we're likely to use them again; we keep as many books as Doe will allow in our shelf, since it's faster to access them than to go to the library stacks. Memory, then, contains copies of data on disk that have been recently used. Similarly, the cache contains copies of data in memory that have been recently used. In general, each successive level contains the most used data from the next lower level. Each load instruction first checks the cache to see if it contains the addressed data; only if it doesn't will main memory be accessed. (We'll discuss shortly how stores are handled.) The actions we take at various levels of the memory hierarchy are similar. When we make an access and find it there (we call it a "hit") and return the data immediately. When what we're looking for is not there (we call it a "miss") we go to the next lower level and bring it up. (The frequency at which we have to dig down and do that is called the "miss rate".) However, the means for taking these actions are very different at different levels. For fast memory in front of slow memory, we call the fast one a cache and everything is handled in hardware. For example, it may take 1 cycle to process a hit and tens or hundreds to process a miss. For memory in front of disk, we call the whole illusion "virtual memory" and hits are handled in hardware while misses are handled in software by the operating sustem. The miss takes millions of instructions. 31.2.3 (Display page) Why caches work Why do caches work?

Caches "cheat the laws of physics" by taking advantage of statistics. Programs don't really access RAM randomly. Programs have temporal and spatial locality and caches exploit this behavior to make them run faster. If we access a particular word, we are likely to access it again soon; this is temporal locality. Spatial locality says that if we use a piece of memory, we're likely to want to use the neighboring pieces sometime soon. 31.2.4 (Brainstorm) Locality a. Describe the general characteristics of a program that would exhibit very high amounts of temporal locality but very little spatial locality with regard to data accesses, and give an illustrative example. b. Describe the general characteristics of a program that would exhibit very little temporal locality but very high amounts of spatial locality with regard to data accesses, and give an illustrative example.

31.2.5 (Display page) Fully associative caches Fully associative caches

A fully associative cache is managed just like a hardware version of an assoc table (which you would have seen in CS 61A) or a map (which you would have seen in CS 61B). It stores in essence a collection of a constant number of address/contents pairs. In reality, it stores what P&H refer to as a tag, which contains only as much of the address as is necessary to find it in the table. P&H also refer to the contents as the data, and to a pair plus bookkeeping information (e.g. a "valid?" bit that says if the cache entry contains meaningful data) as a block. When a load is encountered, the cache is searched for a block with the desired address. If a block with the desired address is in the cache, that's a hit and the corresponding data is returned. If the cache does not contain a block with the desired address, that's a miss; the requested information is retrieved from main memory and also stored in the cache. Unlike your software lookup that searches sequentially through all the tags in the table, in hardware we do them all in parallel. When the collection fills up, something currently in the collection must be replaced by the incoming address/contents pair. With a small cache, the least recently used block might be replaced; that's the block that has been unreferenced for the longest time. (The circuitry to keep track of when a block was used is complicated.) With a larger cache, the block to replace might be chosen at random, since random choice is much easier to implement in hardware. Using sequential search to find an entry in a fully associative cache would be too slow, so the search is done in parallel. This requires a comparator for each cache entry to check an address against all the addresses currently in the cache, plus a selector to access the correct contents. If the address is found, the associated contents is returned. A diagram appears below.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 145 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

We assume that the cache contains only words. This allows us to economize a bit (actually 2 bits) in storing the address. Just as in jump and branch instructions, where the last two bits of an address are 0 and therefore aren't stored, we can also get away with keeping just 30 bits of the address as the tag. 31.2.6 (Brainstorm) Programs and caches Consider two programs, program A with high spatial locality and low temporal locality, and program B with just the opposite. Which of the following is more likely? Briefly explain your answer.

a. Using a fully associative cache will give program A more of a performance boost than program B. b. Using a fully associative cache will give program B more of a performance boost than program A. c. The benefits of using a fully associative cache will be roughly the same for the two programs.

31.2.7 (Display page) Example cache analysis A short example

Here's an example of the use of a four-word fully associative cache with a least-recently-used replacement policy. The tag fields in the cache are omitted in the diagram to save space; the contents field in each cache entry contains M(addr), which means the word stored in memory at address addr, or "--", which means an invalid entry. The sequence of byte addresses accessed—the "reference string", in P&H terms—is 410 (001002), 810 (010002), 1610 (100002), 2410 (110002), 810 (010002), 1210 (011002), 1610 (100002), 1210 (011002). The least recently used block appears in italics. address accessed hit or miss? cache contents M(4) 4 (00100) miss ------M(4) 8 (01000) miss M(8) -- -- M(4) 16 (10000) miss M(8) M(16) -- M(4) 24 (11000) miss M(8) M(16) M(24) M(4) 8 (01000) hit M(8) M(16) M(24) M(12) 12 (01100) miss M(8) M(16) M(24) M(12) 16 (10000) hit M(8) M(16) M(24)

M(12) 12 (01100) hit M(8) M(16) M(24) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 146 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

31.2.8 (Brainstorm) A longer example Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44

Assume that a fully associative cache with 8 one-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 31.2.9 (Display page) Caches in action with MARS Seeing caches in action with MARS

Copy the MIPS assembly language program ~cs61cl/code/cache.s to your local directory and open it in MARS. This is a very simple program that walks sequentially through an array modifying each word. Examine the program and convince yourself that you understand what it does. It has some flags that allow to go over the array repeatedly or to skip elements when stepping through the array.

1. Set the array size to 128, the step size to 4 (1 word), and the rep count to 1. Assemble the program. 2. Open Tools->Data Cache Simulator and select Fully Associative, LRU, 32 blocks, 1-word blocks from the various popup menus. This will be a cache of 128 bytes. Click Connect to MIPS. 3. Open Tools->Memory Reference Visualizer so you can see how your program is accessing memory. Click Connect to MIPS. 4. Set a break point at the LW instruction where you fetch an element of the array. Hit Run. Single step through the LW. See the load and the miss. Step through the SW. See the hit. This is a simple example of temporal locality. Run will bring you to the next LW. Do this a few times till you see the pattern. Then you can move faster by just hitting Run until you get to the end of the program. What is the hit rate? What is the address stream presented to the memory? Is it clear why half of the accesses hit and half miss?

31.2.10 (Brainstorm) Temporal locality If you were to change the rep count to 2 so that you went over the same array twice, what would the hit rate be? Why? 31.2.11 (Display page) More temporal locality More temporal locality

Go back to the source and change the rep count to 4. Assemble your program. Reset the cache simulator and the visualizer. Set the run speed to about 10 inst/sec so that you can see what is going on. Let it fly. Notice what happens in the cache on the first and second repeats. You can check your answer as repetition two completes and see where the hit rates go from there. 31.2.12 (Brainstorm) Replacement Reduce the number of blocks in the cache to 16 and run cache.s again with the rep count set to 4 as above. Observe the hit rate. Why is it so low? 31.2.13 (Display page) Multi-word blocks Multi-word blocks

An elaboration on fully associative caching uses multi-word blocks. When a word is added to the cache, the words around it are also added. This tactic takes advantage of spatial locality. Consider now a cache with 2-word blocks, as in the diagram below.

How is it decided, when an address/contents pair is to be added to the cache, whether the word before it or the word after it is to be added as well? Suppose addr is the byte address we want to add. The approach taken is to add the two words for which addr/8 is the same. For example, if addr is 24, we add the words at addresses 24 and 28. If addr is 36, we add the two words at addresses 32 and 36. This lets us save a bit in the tag—now we only need 29 bits to identify the address of a given cache entry. Let's see how this works with the address references

0, 12, 4, 8 and a four-word cache with a block size of 2. We put the partner word in parentheses to show that it wasn't directly accessed. accessed address hit or miss? cache contents 0 miss 0, (4) -- 12 miss 0, (4) (8), 12 0, 4

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 147 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

0, 4 4 hit 8, 12

8 hit 0, 4 8, 12 31.2.14 (Brainstorm) Spatial locality Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a fully associative cache with four two-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 31.2.15 (Display page) Bigger blocks Bigger blocks

This multi-word block scheme can be extended to four-word blocks, as in the diagram below, and beyond. Only 28 bits are needed for the tag for a four-word block. The words in each block are those for which addr/16 is the same.

31.2.16 (Brainstorm) Spatial locality in MARS Change the configuration of your simulated cache to have four four-word blocks, rather than sixteen one-word blocks. It still has a total size of 64 bytes. Run cache.s again and observe the hit rate. Why is it so much higher? Does it matter if you set the rep count back to 1? 31.2.17 (Brainstorm) Spatial locality example Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a fully associative cache with two four-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 31.2.18 (Display page) Lead-in to next lab Lead-in to next lab

In the next lab, we'll consider two other ways to organize a cache, one called "direct-mapped" caching, the other "N-way set associative" caching. Stay tuned! 31.3 Homework(1 step) 31.3.1 (Display page) Project 3 checkoff Homework

The last project 3 checkoff will be Thursday and Friday. For full credit, you should be able to run entire programs. 32 Caches-A 2009-4-14 ~ 2009-4-15 (3 activities) 32.2 Cache design concepts(18 steps) 32.2.1 (Display page) The memory hierarchy The memory hierarchy

The web site for your favorite computer vendor is selling a hot new item, "This machine will run billions of instructions per second!". It has a 3 GHz CPU clock speed (that's 0.33 nanoseconds per cycle) and 40-60 nanoseconds memory access time. Having taken CS 61CL, you are not settling for marketing hype. So you do a little calculation. Every instruction is fetched from memory and many of them (about 30%) load from or store to memory. Let's call it 1.3 memory accesses per instruction at 50 ns per memory access. That's 80 ns just on memory, or 12.5 million instructions per second! (I.e., not billions.) Is this false advertising? How can so fast an execution speed be claimed with such slow memory access rates? The answer is that there's something between the CPU and main memory called a cache, which we'll describe in a minute. First, we'll describe the memory hierarchy.

At the top is the CPU. It holds a small amount of code and data in registers that can be accessed in a fraction of a nanosecond—a single cycle. Further down is main memory. It has more capacity than the registers, but is still limited. Probably most personal computers used today have a gigabyte of main memory. Lower yet are various kinds of secondary storage, such as hard disks, USB sticks, DVDs, and tapes. A disk drive, for example, has huge capacity—a trillion bytes for around $600 these days—but is also quite slow, with access time in milliseconds.

This hierarchy is illustrated below. In general, the lower the level, the higher the latency (access

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 148 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

time), and the lower the cost per bit. The cache fits in just below the CPU. It's on the chip with the CPU, and so is quickly accessible; access time ranges from 0.5 to 5 nanoseconds. There are levels of cache—level 1, level 2, and so on—that differ in size and in the speed of the bus connecting them to the CPU. Modern personal computers have 64-256 Kb level 1 caches for instructions and data, a 4 Mb level 2 cache, with perhaps a 256 Mb level 3 cache. 32.2.2 (Display page) How caches are used How caches are used

Here's an analogy for how the memory hierarchy is used.

Image courtesy of CS Illustrated

Memory, then, contains copies of data on disk that have been recently used. Similarly, the cache contains copies of data in memory that have been recently used. In general, each successive level contains the most used data from the next lower level. Each load instruction first checks the cache to

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 149 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

see if it contains the addressed data; only if it doesn't will main memory be accessed. (We'll discuss shortly how stores are handled.) The actions we take at various levels of the memory hierarchy are similar. When we make an access and find it there (we call it a "hit") and return the data immediately. When what we're looking for is not there (we call it a "miss") we go to the next lower level and bring it up. (The frequency at which we have to dig down and do that is called the "miss rate".) However, the means for taking these actions are very different at different levels. For fast memory in front of slow memory, we call the fast one a cache and everything is handled in hardware. For example, it may take 1 cycle to process a hit and tens or hundreds to process a miss. For memory in front of disk, we call the whole illusion "virtual memory" and hits are handled in hardware while misses are handled in software by the operating sustem. The miss takes millions of instructions. 32.2.3 (Display page) Why caches work Why do caches work?

Caches "cheat the laws of physics" by taking advantage of statistics. Programs don't really access RAM randomly. Programs have temporal and spatial locality and caches exploit this behavior to make them run faster. If we access a particular word, we are likely to access it again soon; this is temporal locality. Spatial locality says that if we use a piece of memory, we're likely to want to use the neighboring pieces sometime soon. 32.2.4 (Brainstorm) Locality a. Describe the general characteristics of a program that would exhibit very high amounts of temporal locality but very little spatial locality with regard to data accesses, and give an illustrative example. b. Describe the general characteristics of a program that would exhibit very little temporal locality but very high amounts of spatial locality with regard to data accesses, and give an illustrative example.

32.2.5 (Display page) Fully associative caches Fully associative caches

A fully associative cache is managed just like a hardware version of an assoc table (which you would have seen in CS 61A) or a map (which you would have seen in CS 61B). It stores in essence a collection of a constant number of address/contents pairs. In reality, it stores what P&H refer to as a tag, which contains only as much of the address as is necessary to find it in the table. P&H also refer to the contents as the data, and to a pair plus bookkeeping information (e.g. a "valid?" bit that says if the cache entry contains meaningful data) as a block. When a load is encountered, the cache is searched for a block with the desired address. If a block with the desired address is in the cache, that's a hit and the corresponding data is returned. If the cache does not contain a block with the desired address, that's a miss; the requested information is retrieved from main memory and also stored in the cache. Unlike your software lookup that searches sequentially through all the tags in the table, in hardware we do them all in parallel. When the collection fills up, something currently in the collection must be replaced by the incoming address/contents pair. With a small cache, the least recently used block might be replaced; that's the block that has been unreferenced for the longest time. (The circuitry to keep track of when a block was used is complicated.) With a larger cache, the block to replace might be chosen at random, since random choice is much easier to implement in hardware. Using sequential search to find an entry in a fully associative cache would be too slow, so the search is done in parallel. This requires a comparator for each cache entry to check an address against all the addresses currently in the cache, plus a selector to access the correct contents. If the address is found, the associated contents is returned. A diagram appears below.

We assume that the cache contains only words. This allows us to economize a bit (actually 2 bits) in storing the address. Just as in jump and branch instructions, where the last two bits of an address are 0 and therefore aren't stored, we can also get away with keeping just 30 bits of the address as the tag. 32.2.6 (Brainstorm) Programs and caches Consider two programs, program A with high spatial locality and low temporal locality, and program B with just the opposite. Which of the following is more likely? Briefly explain your answer.

a. Using a fully associative cache will give program A more of a performance boost than program B. b. Using a fully associative cache will give program B more of a performance boost than program A. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 150 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

c. The benefits of using a fully associative cache will be roughly the same for the two programs.

32.2.7 (Display page) Example cache analysis A short example

Here's an example of the use of a four-word fully associative cache with a least-recently-used replacement policy. The tag fields in the cache are omitted in the diagram to save space; the contents field in each cache entry contains M(addr), which means the word stored in memory at address addr, or "--", which means an invalid entry. The sequence of byte addresses accessed—the "reference string", in P&H terms—is 410 (001002), 810 (010002), 1610 (100002), 2410 (110002), 810 (010002), 1210 (011002), 1610 (100002), 1210 (011002). The least recently used block appears in italics. address accessed hit or miss? cache contents M(4) 4 (00100) miss ------M(4) 8 (01000) miss M(8) -- -- M(4) 16 (10000) miss M(8) M(16) -- M(4) 24 (11000) miss M(8) M(16) M(24) M(4) 8 (01000) hit M(8) M(16) M(24) M(12) 12 (01100) miss M(8) M(16) M(24) M(12) 16 (10000) hit M(8) M(16) M(24)

M(12) 12 (01100) hit M(8) M(16) M(24) 32.2.8 (Brainstorm) A longer example Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44

Assume that a fully associative cache with 8 one-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 32.2.9 (Display page) Caches in action with MARS Seeing caches in action with MARS

Copy the MIPS assembly language program ~cs61cl/code/cache.s to your local directory and open it in MARS. This is a very simple program that walks sequentially through an array modifying each word. Examine the program and convince yourself that you understand what it does. It has some flags that allow to go over the array repeatedly or to skip elements when stepping through the array.

1. Set the array size to 128, the step size to 4 (1 word), and the rep count to 1. Assemble the program. 2. Open Tools->Data Cache Simulator and select Fully Associative, LRU, 32 blocks, 1-word blocks from the various popup menus. This will be a cache of 128 bytes. Click Connect to MIPS. 3. Open Tools->Memory Reference Visualizer so you can see how your program is accessing memory. Click Connect to MIPS. 4. Set a break point at the LW instruction where you fetch an element of the array. Hit Run. Single step through the LW. See the load and the miss. Step through the SW. See the hit. This is a simple example of temporal locality. Run will bring you to the next LW. Do this a few times till you see the pattern. Then you can move faster by just hitting Run until you get to the end of the program. What is the hit rate? What is the address stream presented to the memory? Is it clear why half of the accesses hit and half miss? http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 151 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

32.2.10 (Brainstorm) Temporal locality If you were to change the rep count to 2 so that you went over the same array twice, what would the hit rate be? Why? 32.2.11 (Display page) More temporal locality More temporal locality

Go back to the source and change the rep count to 4. Assemble your program. Reset the cache simulator and the visualizer. Set the run speed to about 10 inst/sec so that you can see what is going on. Let it fly. Notice what happens in the cache on the first and second repeats. You can check your answer as repetition two completes and see where the hit rates go from there. 32.2.12 (Brainstorm) Replacement Reduce the number of blocks in the cache to 16 and run cache.s again with the rep count set to 4 as above. Observe the hit rate. Why is it so low? 32.2.13 (Display page) Multi-word blocks Multi-word blocks

An elaboration on fully associative caching uses multi-word blocks. When a word is added to the cache, the words around it are also added. This tactic takes advantage of spatial locality. Consider now a cache with 2-word blocks, as in the diagram below.

How is it decided, when an address/contents pair is to be added to the cache, whether the word before it or the word after it is to be added as well? Suppose addr is the byte address we want to add. The approach taken is to add the two words for which addr/8 is the same. For example, if addr is 24, we add the words at addresses 24 and 28. If addr is 36, we add the two words at addresses 32 and 36. This lets us save a bit in the tag—now we only need 29 bits to identify the address of a given cache entry. Let's see how this works with the address references

0, 12, 4, 8 and a four-word cache with a block size of 2. We put the partner word in parentheses to show that it wasn't directly accessed. accessed address hit or miss? cache contents 0 miss 0, (4) -- 12 miss 0, (4) (8), 12 0, 4 4 hit 8, 12

8 hit 0, 4 8, 12 32.2.14 (Brainstorm) Spatial locality Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a fully associative cache with four two-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 32.2.15 (Display page) Bigger blocks Bigger blocks

This multi-word block scheme can be extended to four-word blocks, as in the diagram below, and beyond. Only 28 bits are needed for the tag for a four-word block. The words in each block are those for which addr/16 is the same.

32.2.16 (Brainstorm) Spatial locality in MARS Change the configuration of your simulated cache to have four four-word blocks, rather than sixteen one-word blocks. It still has a total size of 64 bytes. Run cache.s again and observe the hit rate. Why is it so much higher? Does it matter if you set the rep count back to 1? 32.2.17 (Brainstorm) Spatial locality example Here is a series of address references (byte addresses): http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 152 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a fully associative cache with two four-word blocks is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 32.2.18 (Display page) Lead-in to next lab Lead-in to next lab

In the next lab, we'll consider two other ways to organize a cache, one called "direct-mapped" caching, the other "N-way set associative" caching. Stay tuned! 32.3 Homework(1 step) 32.3.1 (Display page) Project 3 checkoff Homework

The last project 3 checkoff will be Thursday and Friday. For full credit, you should be able to run entire programs. 33 More on caches 2009-4-16 ~ 2009-4-17 (4 activities) 33.1 The last project 3 checkoff(1 step) 33.1.1 (Display page) Checkoff requirements Checkoff requirements

To earn full credit (6 points) in this milestone, you should be able to execute sequences of instructions (2 points), plus jumps (2 points) and branches (2 points). 33.2 Other cache strategies(19 steps) 33.2.1 (Display page) Direct mapping Direct-mapped caches

Last lab, we discussed fully associative caches, in which a particular tag/data pair can be anywhere in the cache. We've already noted that the circuitry needed to find something in the cache is complicated. In a direct-mapped cache, each memory address is associated with exactly one location within the cache. This is inflexible but easy to implement in hardware. The block location in the cache is determined by bits in the address. With an eight-word direct-mapped cache, the last two bits in the address specify a position within the word. The next three bits in the address (counting from the right) is an index to one of the eight cache locations, as shown in P&H Figure 5.5

(7.5 in the third edition) below. The remainder is the tag with which a cache entry is identified. The figure below shows how the bits are classified in an eight-word direct-mapped cache: "t" is part of the tag, "i" is part of the index, and "p" is part of the position within the word.

tttttttttttttttttttttttttttiiipp

To see if a given address is in the cache, we isolate bits 4-2 of the address—call them index—and then compare the tag in the index'th entry with the tag from the address we're looking for. Here is a short example that uses a four-word cache. Bits 3 and 2 are used to index the cache. address accessed hit or miss? cache contents -- 4 (00100) miss M(4) ------8 (01000) miss M(4) M(8) --

M(16) 16 (10000) miss M(4) M(8) -- M(16) 24 (11000) miss M(4) M(24)

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 153 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

-- M(16) 8 (01000) miss M(4) M(8) -- M(16) 12 (01100) miss M(4) M(8) M(12) M(16) 16 (10000) hit M(4) M(8) M(12)

M(16) 12 (01100) hit M(4) M(8) M(12) 33.2.2 (Display page) Multi-word blocks (again) Multi-word blocks (again)

Like a fully associative cache, a direct-mapped cache can be used with multi-word blocks. The rightmost bits in an address then specify a position within the block. If there are two words per block, there are three position bits; four words per block require four position bits; and so on. The next bits specify an index within the cache. The remainder are the tag as before. Here's a picture of an eight- word direct-mapped cache with a two-word block size, followed by the corresponding use of the bits

within an address.

tttttttttttttttttttttttttttiippp 33.2.3 (Brainstorm) Analysis of direct-mapped caches Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a direct-mapped cache with eight words and two words per block is initially empty. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 33.2.4 (Display page) Direct-mapped caches in MARS Direct-mapped caches in MARS

Change the placement policy in the MARS data cache performance tool to Direct Mapping. Run cache.s and compare the hit rates with a direct mapped and fully associative cache. Can you explain what you observe? Now let's modify the program slightly to change the address stream that it generates. Add a variable to the beginning of the data segment.

.data counter: 0 array: .space 2048 Add an instruction outside the loop to set up a pointer to this variable.

la $t2, counter Also add an instruction in the inner loop to store in this variable.

addu $s4,$s4,$s1 addu $s5,$s5,$s1 sw $s3, 0($t2) blt $s4,$s3,innerloop Set a break point or slow down the simulation rate so that you can see the effect of mixing stores to counter with stepping through the array. In the next step, compare the hit rates with associative and direct mapped caches on this program. 33.2.5 (Brainstorm) Comparison of cache behavior Compare the hit rates of associative and direct-mapped caches on the modified cache.s, and briefly explain the differences. 33.2.6 (Display page) Set-associative caches Set-associative caches

A hybrid organization that combines features of direct-mapped and fully associative caches is a set- associative cache. An N-way set-associative cache is like a direct-mapped cache, each of whose

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 154 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

elements is a fully associative cache of N elements. Thus the cache consists of a number of sets, each of which consists of N blocks. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set. The figures below portray a two-way set-associative cache and a four-way set-associative cache, both with a total of eight words.

2-way set-associative cache 4-way set-associative cache Here's a short example of the use of a two-way set-associative cache with four words. Bit 2 is used to choose the set. address accessed hit or miss? cache contents 4 (00100) miss --, -- M(4), --

8 (01000) miss M(8), -- M(4), -- 16 (10000) miss M(8), M(16) M(4), -- 24 (11000) miss M(24), M(16) M(4), -- 8 (01000) miss M(24), M(8) M(4), -- 12 (01100) miss M(24), M(8) M(4), M(12) 16 (10000) miss M(16), M(8) M(4), M(12) 12 (01100) hit M(16), M(8) M(4), M(12) 33.2.7 (Brainstorm) Set associative analysis Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a 2-way set associative cache with eight words total is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 33.2.8 (Brainstorm) Two-way set-associative in MARS Modify the set size to be 2 blocks and run your modified version of cache.s (the version with the counter). How does the hit rate of the two-way set associative cache compare to those of fully associative and direct mapped caches of the same size? What do you think is the best cost- performance tradeoff? 33.2.9 (Display page) Reasoning from cache behavior to cache properties Reasoning from cache behavior to cache properties

The next few activities involve some experimentation. We give you the desired behavior of the cache, and you configure the original version of cache.c to produce that behavior. 33.2.10 (Brainstorm) Experiment 1 Using the original version of cache.s, find a block size that, with an array size of 1024 and a step size of 4, produces a hit rate of 96.875% (31 hits out of 32 accesses). 33.2.11 (Brainstorm) Experiment 2 Using an array size of 1024, find a value for stepsize for which a 2-way set associative cache produces a better hit rate than a direct-mapped cache of the same size. Explain your solution. Also explain whether or not there is a step size for which a 2-way set associative cache produces a worse hit rate than a direct-mapped cache of the same size. 33.2.12 (Brainstorm) Experiment 3 Using a value of 4 for repcount, find the cache parameter values that produce the hit rates in the table below:

What is the cache size in bytes? What is the block size in bytes? http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 155 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What is the allocation policy? (Direct mapped, fully associative, or set associative? If N-way set associative, what is N?)

Briefly explain how you determined these parameters. array size step size 4 8 16 32 64 128 256 64 93.75% 87.5% 87.5% 87.5% 87.5% 128 93.75% 87.5% 87.5% 87.5% 87.5% 87.5% 256 93.75% 87.5% 87.5% 87.5% 87.5% 87.5% 87.5% 512 75% 50% 50% 50% 50% 87.5% 87.5% 1024 75% 50% 50% 50% 50% 50% 87.5% 2048 75% 50% 50% 50% 50% 50% 50% 33.2.13 (Brainstorm) Comparison of cache types Associativity usually improves the miss ratio, but not always. Give a short series of address references for which a two-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size. Briefly explain how you found your answer. 33.2.14 (Display page) Cache misses Cache misses

When a certain block cannot be found in the cache, we have a cache miss. All cache misses can be classified into one of three categories through a model known as "the three Cs":

Compulsory misses: These misses occur on the first access to a block that has not existed in the cache previously. Capacity misses: When the execution of a program requires more blocks than the cache can store, capacity misses occur. Blocks are replaced due to lack of space and then retrieved, creating these misses. Conflict misses: In set-associative and direct-mapped caches, multiple blocks may compete for the same set. Blocks can go anywhere in a fully-associative cache, so full associativity eliminates this type of miss.

33.2.15 (Brainstorm) Effect of changing cache parameters What kinds of cache misses (compulsory, conflict, or capacity) are affected by increasing the associativity of a cache? Briefly explain your answer. 33.2.16 (Brainstorm) Effect of reducing memory use? What kinds of misses (compulsory, capacity, or conflict) are likely to be reduced when a program is rewritten so as to require less memory? Briefly explain your answer. 33.2.17 (Display page) Multi-level caches Multilevel caches

When caches first became popular, access to memory took only around 10 processor clock cycles. Since then, however, processor speed has far outstripped the time to access main memory, which is now on the order of 200 processor clock cycles. Secondary (level 2) and sometimes tertiary (level 3) caches were introduced to reduce memory access time. These fit straightforwardly into the memory hierarchy. To find the contents at a given address, the top-level cache is searched. If we get a miss there, we search the secondary cache (which is bigger and somewhat slower). If we get a miss there, we search the tertiary cache, and finally access main memory. Analyzing the performance is done as follows. We have two caches to worry about—we'll call them "L1" and "L2", so there are hit time, miss rate, and miss penalty for both. The diagram below shows the relationships among them.

Average access time is then L1 hit time + L1 miss rate * L1 miss penalty. L1 miss penalty is the L2 hit time + L2 miss rate * L2 miss penalty. Combining these equations, we get the average access time is

L1 hit time + L1 miss rate * (L2 hit time + L2 miss rate * L2 miss penalty)

Let's plug in some numbers. Suppose that access to the primary cache takes 2ns, access to the secondary cache takes 5ns, and access to main memory takes 100ns, including all the miss handling. Also assume that the miss rate for the primary cache is 2% and the miss rate for the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 156 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

secondary cache is 0.5%. Then the average access time is 2ns + .02 * (5ns + .005 * 100ns) = 2ns + 0.1ns + .01ns = 2.11ns. 33.2.18 (Brainstorm) Desired cache properties? Which of the following is generally true about a design with multiple levels of cache?

a. First-level caches are more concerned with hit time, and second-level caches are more concerned about miss rate. b. First-level caches are more concerned with miss rate, and second-level caches are more concerned about hit time.

Briefly defend your answer. 33.2.19 (Display page) A bit of review A bit of review

Here's a review of the types of caches you learned about in this lesson:

33.3 Relevance of caches to real programs(4 steps) 33.3.1 (Display page) Cache-enlightened programming Cache-enlightened programming

Complete the cache exercises from the previous lab before you start on these. P&H on pages 489- 490 (507-508 in the third edition) compare the performance of radix sort and Quicksort. In big-Oh terms, radix sort is better than Quicksort (even if the latter's worst case is avoided). Figure 5.18a (below; 7.18 in the third edition) illustrates this algorithmic advantage by comparing the number of instructions executed by radix sort and Quicksort. In Figure 5.18b, however, which plots the average number of clock cycles needed for the two routines, we see that Quicksort outperforms radix sort, even for large arrays. The reason is shown in Figure 5.18c—Quicksort makes much better use of the http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 157 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

cache. 33.3.2 (Brainstorm) Cache-enlightened matrix initialization Consider the following two versions of code to initialize a two-dimensional array. int array[256][256]; int array[256][256]; ...... for (row=0; row<256; row++) { for (col=0; col<256; col++) { for (col=0; col<256; col++) { for (row=0; row<256; row++) { array[row][col] = row * col; array[row][col] = row * col; } } } } Which version makes best use of the cache? Does it matter whether the cache is direct-mapped or associative? Briefly explain your answer. 33.3.3 (Brainstorm) Cache-enlightened matrix multiplication The files ~cs61cl/code/mmult1.c and mmult2.c contain programs to initialize, then multiply some arrays whose row and column count is specified on the command line. Compile these programs into executable files named mmult1 and mmult2. Then run each of them using the UNIX time command, for example,

time mmult1 128 to time the matrix operations on 128-by-128 arrays. Determine by experiment the smallest argument that produces a speedup of 10% of one program over another, and report that argument. Also describe the organization of array accesses that causes the slower program to perform badly. 33.3.4 (Brainstorm) Cache-enlightened data structure design In most cases, programs using arrays will be more likely than programs using linked lists to benefit from the cache, since linked lists can have their elements scattered all throughout memory. Suggest a couple of applications of linked lists where this would not necessarily be the case, that is, where a linked list implementation of a data type would make almost as good use of a cache as an array implementation. Briefly explain your answer. 33.4 Homework and reading(2 steps) 33.4.1 (Display page) Upcoming topics Upcoming topics

Activities in the next lab will introduce the concept of virtual memory, covered in P&H sections 5.4 and 5.5 (fourth edition) or 7.4 and 7.5 (third edition). 33.4.2 (Display page) Homework Homework

Please finish any of the lab activities from this week that you haven't yet gotten around to. The due date for submission of Project 3 is Monday evening, April 20, at 11:59pm. Submit your circuit online, using the submit command. Homework assignment 9 (the last homework) and project 4 (which you

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 158 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

may do with a partner) will appear over the weekend. 34 More on caches-A 2009-4-16 ~ 2009-4-17 (4 activities) 34.1 The last project 3 checkoff(1 step) 34.1.1 (Display page) Checkoff requirements Checkoff requirements

To earn full credit (6 points) in this milestone, you should be able to execute sequences of instructions (2 points), plus jumps (2 points) and branches (2 points). 34.2 Other cache strategies(19 steps) 34.2.1 (Display page) A preview of the lesson A preview of the lesson

Here's a preview of the types of caches you'll learn about in this lesson:

34.2.2 (Display page) Direct mapping Direct-mapped caches

Last lab, we discussed fully associative caches, in which a particular tag/data pair can be anywhere in the cache. We've already noted that the circuitry needed to find something in the cache is complicated. In a direct-mapped cache, each memory address is associated with exactly one location within the cache. This is inflexible but easy to implement in hardware. The block location in the cache is determined by bits in the address. With an eight-word direct-mapped cache, the last two bits in the address specify a position within the word. The next three bits in the address (counting from the right) is an index to one of the eight cache locations, as shown in P&H Figure 5.5

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 159 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(7.5 in the third edition) below. The remainder is the tag with which a cache entry is identified. The figure below shows how the bits are classified in an eight-word direct-mapped cache: "t" is part of the tag, "i" is part of the index, and "p" is part of the position within the word.

tttttttttttttttttttttttttttiiipp

To see if a given address is in the cache, we isolate bits 4-2 of the address—call them index—and then compare the tag in the index'th entry with the tag from the address we're looking for. Here is a short example that uses a four-word cache. Bits 3 and 2 are used to index the cache. address accessed hit or miss? cache contents -- 4 (00100) miss M(4) ------8 (01000) miss M(4) M(8) --

M(16) 16 (10000) miss M(4) M(8) -- M(16) 24 (11000) miss M(4) M(24) -- M(16) 8 (01000) miss M(4) M(8) -- M(16) 12 (01100) miss M(4) M(8) M(12) M(16) 16 (10000) hit M(4) M(8) M(12)

M(16) 12 (01100) hit M(4) M(8) M(12) 34.2.3 (Display page) Multi-word blocks (again) Multi-word blocks (again)

Like a fully associative cache, a direct-mapped cache can be used with multi-word blocks. The rightmost bits in an address then specify a position within the block. If there are two words per block, there are three position bits; four words per block require four position bits; and so on. The next bits specify an index within the cache. The remainder are the tag as before. Here's a picture of an eight- word direct-mapped cache with a two-word block size, followed by the corresponding use of the bits

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 160 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

within an address.

tttttttttttttttttttttttttttiippp 34.2.4 (Brainstorm) Analysis of direct-mapped caches Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a direct-mapped cache with eight words and two words per block is initially empty. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 34.2.5 (Display page) Direct-mapped caches in MARS Direct-mapped caches in MARS

Change the placement policy in the MARS data cache performance tool to Direct Mapping. Run cache.s and compare the hit rates with a direct mapped and fully associative cache. Can you explain what you observe? Now let's modify the program slightly to change the address stream that it generates. Add a variable to the beginning of the data segment.

.data counter: 0 array: .space 2048 Add an instruction outside the loop to set up a pointer to this variable.

la $t2, counter Also add an instruction in the inner loop to store in this variable.

addu $s4,$s4,$s1 addu $s5,$s5,$s1 sw $s3, 0($t2) blt $s4,$s3,innerloop Set a break point or slow down the simulation rate so that you can see the effect of mixing stores to counter with stepping through the array. In the next step, compare the hit rates with associative and direct mapped caches on this program. 34.2.6 (Brainstorm) Comparison of cache behavior Compare the hit rates of associative and direct-mapped caches on the modified cache.s, and briefly explain the differences. 34.2.7 (Display page) Set-associative caches Set-associative caches

A hybrid organization that combines features of direct-mapped and fully associative caches is a set- associative cache. An N-way set-associative cache is like a direct-mapped cache, each of whose elements is a fully associative cache of N elements. Thus the cache consists of a number of sets, each of which consists of N blocks. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set. The figures below portray a two-way set-associative cache and a four-way set-associative cache, both with a total of eight words.

2-way set-associative cache 4-way set-associative cache Here's a short example of the use of a two-way set-associative cache with four words. Bit 2 is used to choose the set. address accessed hit or miss? cache contents 4 (00100) miss --, -- M(4), --

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 161 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

8 (01000) miss M(8), -- M(4), -- 16 (10000) miss M(8), M(16) M(4), -- 24 (11000) miss M(24), M(16) M(4), -- 8 (01000) miss M(24), M(8) M(4), -- 12 (01100) miss M(24), M(8) M(4), M(12) 16 (10000) miss M(16), M(8) M(4), M(12) 12 (01100) hit M(16), M(8) M(4), M(12) 34.2.8 (Brainstorm) Set associative analysis Here is a series of address references (byte addresses):

8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44 Assume that a 2-way set associative cache with eight words total is initially empty, and uses a least- recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache. 34.2.9 (Brainstorm) Two-way set-associative in MARS Modify the set size to be 2 blocks and run your modified version of cache.s (the version with the counter). How does the hit rate of the two-way set associative cache compare to those of fully associative and direct mapped caches of the same size? What do you think is the best cost- performance tradeoff? 34.2.10 (Display page) Reasoning from cache behavior to cache properties Reasoning from cache behavior to cache properties

The next few activities involve some experimentation. We give you the desired behavior of the cache, and you configure the original version of cache.c to produce that behavior. 34.2.11 (Brainstorm) Experiment 1 Using the original version of cache.s, find a block size that, with an array size of 1024 and a step size of 4, produces a hit rate of 96.875% (31 hits out of 32 accesses). 34.2.12 (Brainstorm) Experiment 2 Using an array size of 1024, find a value for stepsize for which a 2-way set associative cache produces a better hit rate than a direct-mapped cache of the same size. Explain your solution. Also explain whether or not there is a step size for which a 2-way set associative cache produces a worse hit rate than a direct-mapped cache of the same size. 34.2.13 (Brainstorm) Experiment 3 Using a value of 4 for repcount, find the cache parameter values that produce the hit rates in the table below:

What is the cache size in bytes? What is the block size in bytes? What is the allocation policy? (Direct mapped, fully associative, or set associative? If N-way set associative, what is N?)

Briefly explain how you determined these parameters. array size step size 4 8 16 32 64 128 256 64 93.75% 87.5% 87.5% 87.5% 87.5% 128 93.75% 87.5% 87.5% 87.5% 87.5% 87.5% 256 93.75% 87.5% 87.5% 87.5% 87.5% 87.5% 87.5% 512 75% 50% 50% 50% 50% 87.5% 87.5% 1024 75% 50% 50% 50% 50% 50% 87.5% 2048 75% 50% 50% 50% 50% 50% 50% 34.2.14 (Brainstorm) Comparison of cache types Associativity usually improves the miss ratio, but not always. Give a short series of address references for which a two-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size. Briefly explain how you found your answer. 34.2.15 (Display page) Cache misses Cache misses

Here's an illustration to explain the different types of cache misses:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 162 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Image courtesy of CS Illustrated

34.2.16 (Brainstorm) Effect of changing cache parameters What kinds of cache misses (compulsory, conflict, or capacity) are affected by increasing the associativity of a cache? Briefly explain your answer. 34.2.17 (Brainstorm) Effect of reducing memory use? What kinds of misses (compulsory, capacity, or conflict) are likely to be reduced when a program is rewritten so as to require less memory? Briefly explain your answer. 34.2.18 (Display page) Multi-level caches Multilevel caches

When caches first became popular, access to memory took only around 10 processor clock cycles. Since then, however, processor speed has far outstripped the time to access main memory, which is now on the order of 200 processor clock cycles. Secondary (level 2) and sometimes tertiary (level 3) caches were introduced to reduce memory access time. These fit straightforwardly into the memory hierarchy. To find the contents at a given address, the top-level cache is searched. If we get a miss there, we search the secondary cache (which is bigger and somewhat slower). If we get a miss there, we search the tertiary cache, and finally access main memory. Analyzing the performance is done as follows. We have two caches to worry about—we'll call them "L1" and "L2", so there are hit time, miss rate, and miss penalty for both. The diagram below shows the relationships among them.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 163 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Average access time is then L1 hit time + L1 miss rate * L1 miss penalty. L1 miss penalty is the L2 hit time + L2 miss rate * L2 miss penalty. Combining these equations, we get the average access time is

L1 hit time + L1 miss rate * (L2 hit time + L2 miss rate * L2 miss penalty)

Let's plug in some numbers. Suppose that access to the primary cache takes 2ns, access to the secondary cache takes 5ns, and access to main memory takes 100ns, including all the miss handling. Also assume that the miss rate for the primary cache is 2% and the miss rate for the secondary cache is 0.5%. Then the average access time is 2ns + .02 * (5ns + .005 * 100ns) = 2ns + 0.1ns + .01ns = 2.11ns. 34.2.19 (Brainstorm) Desired cache properties? Which of the following is generally true about a design with multiple levels of cache?

a. First-level caches are more concerned with hit time, and second-level caches are more concerned about miss rate. b. First-level caches are more concerned with miss rate, and second-level caches are more concerned about hit time.

Briefly defend your answer. 34.3 Relevance of caches to real programs(4 steps) 34.3.1 (Display page) Cache-enlightened programming Cache-enlightened programming

Complete the cache exercises from the previous lab before you start on these. P&H on pages 489- 490 (507-508 in the third edition) compare the performance of radix sort and Quicksort. In big-Oh terms, radix sort is better than Quicksort (even if the latter's worst case is avoided). Figure 5.18a (below; 7.18 in the third edition) illustrates this algorithmic advantage by comparing the number of instructions executed by radix sort and Quicksort. In Figure 5.18b, however, which plots the average number of clock cycles needed for the two routines, we see that Quicksort outperforms radix sort, even for large arrays. The reason is shown in Figure 5.18c—Quicksort makes much better use of the

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 164 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

cache. 34.3.2 (Brainstorm) Cache-enlightened matrix initialization Consider the following two versions of code to initialize a two-dimensional array. int array[256][256]; int array[256][256]; ...... for (row=0; row<256; row++) { for (col=0; col<256; col++) { for (col=0; col<256; col++) { for (row=0; row<256; row++) { array[row][col] = row * col; array[row][col] = row * col; } } } } Which version makes best use of the cache? Does it matter whether the cache is direct-mapped or associative? Briefly explain your answer. 34.3.3 (Brainstorm) Cache-enlightened matrix multiplication The files ~cs61cl/code/mmult1.c and mmult2.c contain programs to initialize, then multiply some arrays whose row and column count is specified on the command line. Compile these programs into executable files named mmult1 and mmult2. Then run each of them using the UNIX time command, for example,

time mmult1 128 to time the matrix operations on 128-by-128 arrays. Determine by experiment the smallest argument that produces a speedup of 10% of one program over another, and report that argument. Also describe the organization of array accesses that causes the slower program to perform badly. 34.3.4 (Brainstorm) Cache-enlightened data structure design In most cases, programs using arrays will be more likely than programs using linked lists to benefit from the cache, since linked lists can have their elements scattered all throughout memory. Suggest a couple of applications of linked lists where this would not necessarily be the case, that is, where a linked list implementation of a data type would make almost as good use of a cache as an array implementation. Briefly explain your answer. 34.4 Homework and reading(2 steps) 34.4.1 (Display page) Upcoming topics Upcoming topics

Activities in the next lab will introduce the concept of virtual memory, covered in P&H sections 5.4 and 5.5 (fourth edition) or 7.4 and 7.5 (third edition). 34.4.2 (Display page) Homework Homework

Please finish any of the lab activities from this week that you haven't yet gotten around to. The due date for submission of Project 3 is Monday evening, April 20, at 11:59pm. Submit your circuit online, using the submit command. Homework assignment 9 (the last homework) and project 4 (which you

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 165 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

may do with a partner) will appear over the weekend. 35 Virtual memory 2009-4-21 ~ 2009-4-22 (4 activities) 35.2 A Recap of Caches(2 steps) 35.2.1 (Display page) The Purpose of Caching Summary:An illustration reviewing why we use caches 35.2.2 (Display page) Types of Cache Misses Summary:An illustration reviewing the types of cache misses 35.3 Explore virtual-to-physical address translation.(14 steps) 35.3.1 (Display page) Illusions Illusions

Today is for dispelling illusions.

Illusion 1: You're the only user on the computer. Illusion 2: Your programs may access any address in a 32-bit address space. Your programs always start at the same address. Memory in the text segment is contiguous, as is memory in the heap and in the stack.

Lies. All lies. Actually, the illusion of being the only user on the computer is easy to dispell. Just type "who" to a UNIX shell, or run Activity Monitor on a Mac OS X system, or run Process Explorer on Windows. A lot of things are running at once, even in a "personal" computer. But how is "running at once" supported? The reality of memory is this: A computer might not have enough installed physical memory to handle a 32- bit address. (4 gigabytes of memory is needed.) Where a program is actually situated in physical memory will change from time to time. And a million-byte array, which to the user looks like an unbroken sequence of contiguous cells, is really broken up and scattered through physical memory and disk. What the user sees is virtual memory, the basis of the illusions mentioned above. It fits into the memory hierarchy between

memory and disk, as shown below. The operating system and the hardware in tandem maintain a set of pages that collectively represent the memory used by each program. When the total size of all the pages in use by processes on the system is greater than the amount of available physical memory, some pages will be stored on disk —"swapped out". Eventually a program will want to access something on a swapped-out page; it must then be "swapped in", and some other page chosen to write to disk. This organization is pictured below. On the left and on the right of the diagram are user views of two programs. In the middle are the actual pages for these programs, some currently in physical memory, others swapped out to the disk.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 166 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What this system requires is a method for finding the page corresponding to a given virtual address, and translating the address to a physical address. We discuss that next. 35.3.2 (Display page) Address translation Address translation

Virtual memory space is divided into pages of size 2k; typical page sizes are 4K or 8K. The OS and the hardware in tandem map virtual addresses into physical addresses via a page table. A simple model of a page table is a giant array whose elements are physical addresses of pages and associated bookkeeping information. The page table is always in memory, and is accessed through a hardware page table register. In the simple model, the mapping goes like this. The virtual address is split into two pieces, a virtual page number and a page offset. The virtual page number is then used as an index into the page table, as shown in the diagrams below. Contained in the indexed element is the physical page number, which for a page in memory is combined with the page offset to produce a physical address. If the page is not in memory, we have a page fault. The general procedure:

An example:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 167 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

A page table entry also contains several kinds of status information:

a "valid?" bit—is the page in physical memory?; a "writable?" bit—some information in memory can be read but not written; other access permissions; a "has been changed?" bit, which saves unnecessary writing to disk; and a "has been accessed recently?" bit, which is used to help determine which page currently in physical memory is to be swapped out.

35.3.3 (Self Test) Bit counts? For both these questions, assume that a virtual address has 40 bits. If each page contains 16 kilobytes, how many bits in the virtual address specify the page offset?

How many entries does the page table have? (Express your answer as 2^x, where x is some integer.)

35.3.4 (Brainstorm) Similarity to caching? In what ways is memory management using virtual memory similar to caching? In what ways do they differ? 35.3.5 (Display page) More details More details

Each active process has its own page table. To start a process, the operating system finds some empty pages (this may require swapping out some other pages) and copies the executable text and data segments into them. It also adds corresponding entries to the process's page table. It loads registers $a0 with a pointer to the command line arguments, $ra with the address to return to when the process terminates, $sp with the stack pointer, and $gp with a pointer to the global data area. Then it jumps to the start of the process. Processes may get suspended so that other processes can get their turn to execute. Stored with the page table is information about the state of the process: the program counter and contents of the registers. Suspending a process means saving its registers and PC, then choosing another process to execute, and substituting the new process's page table for the old. To resume a process, we restore the registers, substitute the new process's page table, and jump to the saved PC value. Growing a process means asking for more pages. This is done through the operating system. Growing stack space is easy; growing heap space is done through the sbrk system call, as we observed early in this course. 35.3.6 (Display page) Problems with the simple model Problems with the simple model

The simple model doesn't tell the whole story. One drawback is memory use. We saw in an earlier self-test question an example of a page table with 228 entries—250 megabytes! That's doesn't leave a whole lot of space for user programs. A second problem is time. Each memory access, assuming it's uncached, is turned into two accesses, one to access the page table and the other to access the desired memory. And of course if there's a page fault, the access time increases by orders of magnitude. One solution to the space problem is to have a multi-level page table as shown below.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 168 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Another is to access the page table with a virtual address rather than a physical address. Both these alternatives, though addressing the problem of large page tables, make the speed problem even worse by adding yet another memory access to each uncached store or load. A solution to the speed problem is (unsurprisingly) a cache. It's called a translation-lookaside buffer, usually abbreviated to "TLB". (The name was chosen before caches were in wide use.) The TLB is a relatively small cache that keeps track of virtual address/physical address pairs. On the MIPS, it's either fully associative or almost fully associative. P&H quote some typical values:

TLB size = 16-512 entries block size = 1-2 page table entries (typically 4-8 bytes each) hit time = 0.5-1 clock cycle miss penalty = 10-100 clock cycles miss rate = 0.01%-1%

P&H Figure 7.23 illustrates the TLB. 35.3.7 (Brainstorm) TLB organization? Why is the TLB fully associative rather than direct-mapped? 35.3.8 (Brainstorm) TLB size? We have seen in earlier discussion about caches that the hardware for a fully associative cache is expensive and takes up a lot of space on the chip. That's one reason for keeping the TLB small. Give another reason, relating to the contents of the TLB, that a small cache might still perform acceptably. 35.3.9 (Brainstorm) Tracing a memory access: what gets cached? We will now trace through a memory access, say, a lw. We have the effective address—what happens next? We will either consult the cache, if the cache is storing virtual addresses, or we will send the address through the virtual-to-physical address translation process. Caching virtual address/contents pairs ought to be faster than caching physical address/contents pairs, because the address translation is skipped. Give a couple of reasons for preferring to cache physical addresses. 35.3.10 (Display page) Tracing a memory access, continued Tracing a memory access, continued

Let's assume that we're caching physical addresses rather than virtual addresses. That means that we first consult the TLB to see if it contains a virtual page number/physical page number pair. If so, we construct the physical address and check the cache. If what we're looking for is there, we win big. If not, we stall while we fetch the requested contents into the cache. If the page number is not in the TLB, we have to check the page table to see if the page is in memory. If so, we put the virtual page number/physical page number pair into the TLB, probably evicting something that was there already. Then we continue with cache access as above. (Probability of a cache hit with a TLB miss is low but not zero.) Some errors are detected at this point, all falling under the category "segmentation faults":

the virtual page number is outside the program's limits; the page is protected, e.g. accessible only by the operating system; a store operation is being processed, but the page isn't writable (e.g. constant string storage in C). http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 169 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(Compare these errors with bus errors, which involve addresses that the hardware can't deal with, e.g. alignment errors.) If the page is not in memory, we have a page fault. Dealing with this situation is totally done in software. It involves identifying a page to be swapped out, then swapping in the page that contains the requested memory. This takes a long time; access to disk is roughly 100,000 times slower than access to main memory. (See CS 162 for more information.) Figures 5.24 and 5.25 in P&H (7.24 and 7.25 in the third edition), reproduced below, supply more details for the process of accessing and updating the TLB. On the MIPS, there is hardware support for some TLB operations; P&H note, however, that hardware support won't provide a big performance advantage since the basic operations are the same in either case.

35.3.11 (Brainstorm) Tracing a memory access, part 3 Consider a system with a virtual address of 20 bits, a physical address of 16 bits, a page size of 4K bytes (1K words), a direct-mapped cache of 256 bytes (64 words) with 4-word blocks, and a 4-entry TLB. Assume that the current data store in this system is as follows:

TLB: Virtual page # Physical page # ------0 | 0x1a | 0x2 | 1 | 0x01 | 0x8 | 2 | 0x05 | 0xb | 3 | 0x2f | 0x0 | ------http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 170 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Cache: Valid Tag Word #0 Word #1 Word #2 Word #3 ------0 | 1 | 0x20 | 0x10 | 0x20 | 0x30 | 0x40 | 1 | 1 | 0x08 | 0x11 | 0x22 | 0x33 | 0x44 | 2 | 0 | | | | | | 3 | 0 | | | | | | 4 | 1 | 0xb6 | 0x90 | 0xa0 | 0xb0 | 0xc0 | 5 | 0 | | | | | | 6 | 0 | | | | | | 7 | 0 | | | | | | 8 | 1 | 0xb5 | 0x50 | 0x60 | 0x70 | 0x80 | 9 | 0 | | | | | | a | 0 | | | | | | b | 0 | | | | | | c | 0 | | | | | | d | 0 | | | | | | e | 0 | | | | | | f | 1 | 0x81 | 0x55 | 0x66 | 0x77 | 0x88 | ------

What is the data contained in virtual address 0x05584? Explain how you determined it. 35.3.12 (Brainstorm) Unmapped addresses? There are areas in memory that are unmapped, that is, all references to these areas are assumed to be only with physical addresses. Why would one want to disable address mapping? (One situation would arise when the computer is first turned on.) 35.3.13 (Brainstorm) Event probability? Rank the following events from most likely to least likely. Justify your answer.

a. TLB miss, page table hit, cache hit b. TLB miss, page table hit, cache miss c. TLB miss, page table miss, cache miss

35.3.14 (Display page) Virtual memory compared to cache Virtual memory compared to cache

P&H devote section 5.5 (7.5 in the third edition) to a unified view of these two levels in the memory hierarchy. Both virtual memory and caches need to deal with several questions:

Where can a block be placed? How is a block found? Which block should be replaced on a cache miss? What happens on a write?

The table below lists answers to these questions for typical cache and virtual memory systems. (You should be clear on the reasons for the differences.) cache virtual memory block or line page miss page fault miss is handled in hardware miss is handled in software block size = 32-64 bytes page size = 4-8 KB placement = direct-mapped or N-way set placement = fully associative associative LRU or random LRU writethrough or write back write back 35.4 Homework and readings(1 step) 35.4.1 (Display page) Upcoming material Upcoming material

The next couple of labs will cover the K&R implementation of malloc and free, found in section 8.7. 36 Virtual memory-A 2009-4-21 ~ 2009-4-22 (4 activities) 36.2 A Recap of Caches(2 steps) 36.2.1 (Display page) The Purpose of Caching The Purpose of Caching

Here's an analogy for how the memory hierarchy is used. You're writing a term paper—the "processor"—at a table in your dorm. Doe Library is equivalent to a disk; it has essentially limitless capacity, but it takes a long time to retrieve a book. You have several good books on your bookself. The bookshelf is like main memory. Its capacity is smaller, which means you have to return a book when the shelf fills up. However, it's easier and faster to find a book there than to go to the library. The open books on the table are like the cache. Capacity is much less—an open book takes up more space than a closed one on the shelf, and you may have to shelve a book when too many others are open—but it's much faster to retrieve data. This organization creates the illusion of the whole library open on the tabletop: we keep as many recently used books as possible open on the table, since we're likely to use them again; we keep as many books as Doe will allow in our shelf, since it's faster to access them than to go to the library stacks. 36.2.2 (Display page) Types of Cache Misses http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 171 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Types of Cache Misses

When a certain block cannot be found in the cache, we have a cache miss. All cache misses can be classified into one of three categories through a model known as "the three Cs":

Compulsory misses: These misses occur on the first access to a block that has not existed in the cache previously. Capacity misses: When the execution of a program requires more blocks than the cache can store, capacity misses occur. Blocks are replaced due to lack of space and then retrieved, creating these misses. Conflict misses: In set-associative and direct-mapped caches, multiple blocks may compete for the same set. Blocks can go anywhere in a fully-associative cache, so full associativity eliminates this type of miss.

36.3 Explore virtual-to-physical address translation.(14 steps) 36.3.1 (Display page) Illusions Illusions

Today is for dispelling illusions.

Illusion 1: You're the only user on the computer. Illusion 2: Your programs may access any address in a 32-bit address space. Your programs always start at the same address. Memory in the text segment is contiguous, as is memory in the heap and in the stack.

Lies. All lies. Actually, the illusion of being the only user on the computer is easy to dispell. Just type "who" to a UNIX shell, or run Activity Monitor on a Mac OS X system, or run Process Explorer on Windows. A lot of things are running at once, even in a "personal" computer. But how is "running at once" supported? The reality of memory is this: A computer might not have enough installed physical memory to handle a 32- bit address. (4 gigabytes of memory is needed.) Where a program is actually situated in physical memory will change from time to time. And a million-byte array, which to the user looks like an unbroken sequence of contiguous cells, is really broken up and scattered through physical memory and disk. What the user sees is virtual memory, the basis of the illusions mentioned above. It fits into the memory hierarchy between

memory and disk, as shown below. The operating system and the hardware in tandem maintain a set of pages that collectively represent the memory used by each program. When the total size of all the pages in use by processes on the system is greater than the amount of available physical memory, some pages will be stored on disk —"swapped out". Eventually a program will want to access something on a swapped-out page; it must then be "swapped in", and some other page chosen to write to disk. This organization is pictured below. On the left and on the right of the diagram are user views of two programs. In the middle are the actual pages for these programs, some currently in physical memory, others swapped out to the disk.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 172 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

What this system requires is a method for finding the page corresponding to a given virtual address, and translating the address to a physical address. We discuss that next. 36.3.2 (Display page) Address translation Address translation

Virtual memory space is divided into pages of size 2k; typical page sizes are 4K or 8K. The OS and the hardware in tandem map virtual addresses into physical addresses via a page table. A simple model of a page table is a giant array whose elements are physical addresses of pages and associated bookkeeping information. The page table is always in memory, and is accessed through a hardware page table register. In the simple model, the mapping goes like this. The virtual address is split into two pieces, a virtual page number and a page offset. The virtual page number is then used as an index into the page table, as shown in the diagrams below. Contained in the indexed element is the physical page number, which for a page in memory is combined with the page offset to produce a physical address. If the page is not in memory, we have a page fault. The general procedure:

An example:

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 173 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

A page table entry also contains several kinds of status information:

a "valid?" bit—is the page in physical memory?; a "writable?" bit—some information in memory can be read but not written; other access permissions; a "has been changed?" bit, which saves unnecessary writing to disk; and a "has been accessed recently?" bit, which is used to help determine which page currently in physical memory is to be swapped out.

36.3.3 (Self Test) Bit counts? For both these questions, assume that a virtual address has 40 bits. If each page contains 16 kilobytes, how many bits in the virtual address specify the page offset?

How many entries does the page table have? (Express your answer as 2^x, where x is some integer.)

36.3.4 (Brainstorm) Similarity to caching? In what ways is memory management using virtual memory similar to caching? In what ways do they differ? 36.3.5 (Display page) More details More details

Each active process has its own page table. To start a process, the operating system finds some empty pages (this may require swapping out some other pages) and copies the executable text and data segments into them. It also adds corresponding entries to the process's page table. It loads registers $a0 with a pointer to the command line arguments, $ra with the address to return to when the process terminates, $sp with the stack pointer, and $gp with a pointer to the global data area. Then it jumps to the start of the process. Processes may get suspended so that other processes can get their turn to execute. Stored with the page table is information about the state of the process: the program counter and contents of the registers. Suspending a process means saving its registers and PC, then choosing another process to execute, and substituting the new process's page table for the old. To resume a process, we restore the registers, substitute the new process's page table, and jump to the saved PC value. Growing a process means asking for more pages. This is done through the operating system. Growing stack space is easy; growing heap space is done through the sbrk system call, as we observed early in this course. 36.3.6 (Display page) Problems with the simple model Problems with the simple model

The simple model doesn't tell the whole story. One drawback is memory use. We saw in an earlier self-test question an example of a page table with 228 entries—250 megabytes! That's doesn't leave a whole lot of space for user programs. A second problem is time. Each memory access, assuming it's uncached, is turned into two accesses, one to access the page table and the other to access the desired memory. And of course if there's a page fault, the access time increases by orders of magnitude. One solution to the space problem is to have a multi-level page table as shown below.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 174 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Another is to access the page table with a virtual address rather than a physical address. Both these alternatives, though addressing the problem of large page tables, make the speed problem even worse by adding yet another memory access to each uncached store or load. A solution to the speed problem is (unsurprisingly) a cache. It's called a translation-lookaside buffer, usually abbreviated to "TLB". (The name was chosen before caches were in wide use.) The TLB is a relatively small cache that keeps track of virtual address/physical address pairs. On the MIPS, it's either fully associative or almost fully associative. P&H quote some typical values:

TLB size = 16-512 entries block size = 1-2 page table entries (typically 4-8 bytes each) hit time = 0.5-1 clock cycle miss penalty = 10-100 clock cycles miss rate = 0.01%-1%

P&H Figure 7.23 illustrates the TLB. 36.3.7 (Brainstorm) TLB organization? Why is the TLB fully associative rather than direct-mapped? 36.3.8 (Brainstorm) TLB size? We have seen in earlier discussion about caches that the hardware for a fully associative cache is expensive and takes up a lot of space on the chip. That's one reason for keeping the TLB small. Give another reason, relating to the contents of the TLB, that a small cache might still perform acceptably. 36.3.9 (Brainstorm) Tracing a memory access: what gets cached? We will now trace through a memory access, say, a lw. We have the effective address—what happens next? We will either consult the cache, if the cache is storing virtual addresses, or we will send the address through the virtual-to-physical address translation process. Caching virtual address/contents pairs ought to be faster than caching physical address/contents pairs, because the address translation is skipped. Give a couple of reasons for preferring to cache physical addresses. 36.3.10 (Display page) Tracing a memory access, continued Tracing a memory access, continued

Let's assume that we're caching physical addresses rather than virtual addresses. That means that we first consult the TLB to see if it contains a virtual page number/physical page number pair. If so, we construct the physical address and check the cache. If what we're looking for is there, we win big. If not, we stall while we fetch the requested contents into the cache. If the page number is not in the TLB, we have to check the page table to see if the page is in memory. If so, we put the virtual page number/physical page number pair into the TLB, probably evicting something that was there already. Then we continue with cache access as above. (Probability of a cache hit with a TLB miss is low but not zero.) Some errors are detected at this point, all falling under the category "segmentation faults":

the virtual page number is outside the program's limits; the page is protected, e.g. accessible only by the operating system; a store operation is being processed, but the page isn't writable (e.g. constant string storage in C). http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 175 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

(Compare these errors with bus errors, which involve addresses that the hardware can't deal with, e.g. alignment errors.) If the page is not in memory, we have a page fault. Dealing with this situation is totally done in software. It involves identifying a page to be swapped out, then swapping in the page that contains the requested memory. This takes a long time; access to disk is roughly 100,000 times slower than access to main memory. (See CS 162 for more information.) Figures 5.24 and 5.25 in P&H (7.24 and 7.25 in the third edition), reproduced below, supply more details for the process of accessing and updating the TLB. On the MIPS, there is hardware support for some TLB operations; P&H note, however, that hardware support won't provide a big performance advantage since the basic operations are the same in either case.

36.3.11 (Brainstorm) Tracing a memory access, part 3 Consider a system with a virtual address of 20 bits, a physical address of 16 bits, a page size of 4K bytes (1K words), a direct-mapped cache of 256 bytes (64 words) with 4-word blocks, and a 4-entry TLB. Assume that the current data store in this system is as follows:

TLB: Virtual page # Physical page # ------0 | 0x1a | 0x2 | 1 | 0x01 | 0x8 | 2 | 0x05 | 0xb | 3 | 0x2f | 0x0 | ------http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 176 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Cache: Valid Tag Word #0 Word #1 Word #2 Word #3 ------0 | 1 | 0x20 | 0x10 | 0x20 | 0x30 | 0x40 | 1 | 1 | 0x08 | 0x11 | 0x22 | 0x33 | 0x44 | 2 | 0 | | | | | | 3 | 0 | | | | | | 4 | 1 | 0xb6 | 0x90 | 0xa0 | 0xb0 | 0xc0 | 5 | 0 | | | | | | 6 | 0 | | | | | | 7 | 0 | | | | | | 8 | 1 | 0xb5 | 0x50 | 0x60 | 0x70 | 0x80 | 9 | 0 | | | | | | a | 0 | | | | | | b | 0 | | | | | | c | 0 | | | | | | d | 0 | | | | | | e | 0 | | | | | | f | 1 | 0x81 | 0x55 | 0x66 | 0x77 | 0x88 | ------

What is the data contained in virtual address 0x05584? Explain how you determined it. 36.3.12 (Brainstorm) Unmapped addresses? There are areas in memory that are unmapped, that is, all references to these areas are assumed to be only with physical addresses. Why would one want to disable address mapping? (One situation would arise when the computer is first turned on.) 36.3.13 (Brainstorm) Event probability? Rank the following events from most likely to least likely. Justify your answer.

a. TLB miss, page table hit, cache hit b. TLB miss, page table hit, cache miss c. TLB miss, page table miss, cache miss

36.3.14 (Display page) Virtual memory compared to cache Virtual memory compared to cache

P&H devote section 5.5 (7.5 in the third edition) to a unified view of these two levels in the memory hierarchy. Both virtual memory and caches need to deal with several questions:

Where can a block be placed? How is a block found? Which block should be replaced on a cache miss? What happens on a write?

The table below lists answers to these questions for typical cache and virtual memory systems. (You should be clear on the reasons for the differences.) cache virtual memory block or line page miss page fault miss is handled in hardware miss is handled in software block size = 32-64 bytes page size = 4-8 KB placement = direct-mapped or N-way set placement = fully associative associative LRU or random LRU writethrough or write back write back 36.4 Homework and readings(1 step) 36.4.1 (Display page) Upcoming material Upcoming material

The next couple of labs will cover the K&R implementation of malloc and free, found in section 8.7. 37 Memory management: implementation of malloc and free 2009-4-23 ~ 2009-4-24 (4 activities) 37.1 Review the organization of a program's memory.(4 steps) 37.1.1 (Display page) The four kinds of memory Four kinds of memory

Memory in an executing program is split into four areas:

Program "text": the machine instructions of the program. This doesn't change while the program is running. The stack, used for variables declared inside functions, parameters, and function return addresses. A function's stack segment (and the data it contains) goes away when the function returns. The heap: for variables initialized with malloc. This data lives until deallocated by the programmer. Static memory: for variables declared outside functions. This data is basically permanent for the entire program run.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 177 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Shown below are two diagrams showing how these areas are laid out. On the left is the layout on the MIPS architecture. On the right is the memory layout in early operating systems on the Intel 80x86 architecture.

Memory layout in the MIPS architecture (the Memory layout in early operating text segment stores the code for the executing systems on the Intel 80x86 program) architecture 37.1.2 (Display page) Heap management Heap management

The heap consists of allocated blocks and free blocks. Each block is a contiguous segment. Two operations are applied to the heap:

allocation: finding a contiguous block of the desired size, splitting it if necessary, and returning the leftover memory to the collection of free blocks freeing: returning the block being freed to the collection of free blocks

In contrast to the stack, whose segments are contiguous in memory (adjacent, with no "holes"), the heap may contain gaps of unallocated storage. Below is a sample sequence of calls to malloc and free.

p1 = malloc (2)

p2 = malloc (1)

p3 = malloc (1)

free (p1);

free (p2);

p4 = malloc (2)

p5 = malloc (3)

p6 = malloc (3) Can't allocate! 37.1.3 (Brainstorm) Was stuckness inevitable? In the example just given, the request

p6 = malloc (3);

could not be satisfied, because of the choices that were made to satisfy previous malloc requests. Explain how all the malloc requests could be satisfied by satisfying the earlier mallocs in different ways.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 178 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

37.1.4 (Display page) Implementation issues Implementation issues

In many applications, malloc is called frequently. Thus the processes of choosing and releasing blocks should be fast. The amount of bookkeeping information needed for fast operation should be minimized. Finally, fragmentation—split of free memory into pieces too small to be useful—should be minimized, both by having a good strategy for block selection, and by combining free blocks that are adjacent in memory. Clearly we need to keep track of sizes of blocks. This can be done in several ways: either maintaining a separate table of block sizes, or storing size information with the blocks themselves, or segregating blocks of the same size in the same place. We will now consider the storage allocator in K&R, which takes the second approach (storing size information with the blocks themselves). 37.2 Explore the K&R storage allocator.(13 steps) 37.2.1 (Display page) Block organization Block organization

K&R's allocator splits memory into free blocks and allocated blocks. All blocks have their size stored immediately prior to the start of the block, as shown in the figure below.

Thus each block of a given size actually requires one extra unit of memory, in the block header. For example, a request for 8 units of memory really uses up 9, and so forth. The header information is placed before the memory in the block returned to the user, and is therefore not in an area the user should be accessing. Free blocks are stored in a circular linked list, sorted by memory address. The variable base points to a "dummy" block with size 0, which contains a pointer to the block with lowest memory address, which contains a pointer to the block with next lowest address, and so on. The last node links back to the base node. The address information linking the free blocks is stored in the header. (This information actually could have been stored in the block itself, since free blocks aren't being used by anyone yet.) The block size is expressed not in bytes but in terms of multiples of header size. The minimum size in bytes of an allocated block will always be a multiple of the header size. 37.2.2 (Display page) New C constructs New C constructs

Three new C constructs are used in the storage allocator. One is a union, which is like a struct except that only one of the fields will be relevant at any particular time. In the allocator, a union is used to represent the header:

union header { header layout information for a free block; header layout information for an allocated block; }

Some other information external to the union determines which part of the union is relevant. A second new feature is alignment. As we learned earlier, some information must be aligned at an address that's divisible by 2 (short integers), 4 (ints), or 8 (doubles). Finally comes the typedef operator, which sets up synonyms for types. The syntax is

typedef existingType newTypeName A common example sets up a boolean type:

typedef int bool; Another sets up a node pointer type:

struct node { int data; struct node *next; }; typedef struct node * nodePtr;

In the storage allocator code, a typedef sets up an Align type that is used to ensure that each block starts at an address that's divisible by a suitable power of 2. 37.2.3 (Display page) A sample run Experiments

The K&R storage management code is contained in the file ~cs61cl/code/stgmgmt2.c, along with a main function that repeatedly requests allocation and free requests from the user. Format of user input is

a numberOfBytes f addressOfAllocReturn

The argument to the f command is a hexadecimal address previously returned by an a command. (This information, as well as the free list and a list of all blocks, is printed in the program's response to the allocation request.) Here is a sample run that allocates two blocks, then frees them. Draw a diagram that represents the state of memory at each step, and compare it with the diagram of a labmate.

Command? a 5 http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 179 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

address returned by malloc = 21798 Free blocks: (free) addr = 21390, size = 0, next = 213a0 (free) addr = 213a0, size = 126, next = 21390 ------All blocks: (free) addr = 213a0, size = 126, next = 21390 (alloc) addr = 21790, size = 2

Command? a 20 address returned by malloc = 21778 Free blocks: (free) addr = 21390, size = 0, next = 213a0 (free) addr = 213a0, size = 122, next = 21390 ------All blocks: (free) addr = 213a0, size = 122, next = 21390 (alloc) addr = 21770, size = 4 (alloc) addr = 21790, size = 2

Command? f 21798 Free blocks: (free) addr = 213a0, size = 122, next = 21790 (free) addr = 21790, size = 2, next = 21390 (free) addr = 21390, size = 0, next = 213a0 ------All blocks: (free) addr = 213a0, size = 122, next = 21790 (alloc) addr = 21770, size = 4 (free) addr = 21790, size = 2, next = 21390

Command? f 21778 Free blocks: (free) addr = 213a0, size = 128, next = 21390 (free) addr = 21390, size = 0, next = 213a0 ------All blocks: (free) addr = 213a0, size = 128, next = 21390 37.2.4 (Display page) Experiments Experiments

Run the program. Type some allocation and free requests. Verify or find the following, either by using gdb or by including printf statements in the code.

How many bytes of header information are there per block? How does the size displayed for a block relate to its size in bytes? Show that the size of an allocated block is stored at the address immediately prior to the block's address.

37.2.5 (Brainstorm) The smallest allocatable block? What's the size of the smallest allocatable block? Explain how you found it. 37.2.6 (Display page) Algorithms for allocating and freeing Algorithms for allocating and freeing

Here are the steps used in the K&R code for allocating and freeing blocks. Allocation

1. Search the list of free blocks (linked by the s.ptr field) for a block that's big enough. 2. If a block is exactly the right size, unlink and return it. 3. If a block is too big, split it in two and return the upper part of the block to the user. 4. If no block is big enough, ask the system for more memory, put the result in the free list, and search again.

Freeing

1. Search the free list (which is sorted by memory address) for the closest free blocks. 2. If none are adjacent, insert the newly freed block. 3. Otherwise, combine the block with the adjacent free block(s).

37.2.7 (Brainstorm) Free block count (1) Which of the following statements are true about a successful call to malloc, that is, a call that returns a non-NULL pointer? There may be 0 or more than one true statement in the list. Briefly explain your answer.

a. A successful call to malloc will always reduce the number of free blocks. b. A successful call to malloc may leave the number of free blocks unchanged. c. A successful call to malloc may increase the number of free blocks.

37.2.8 (Brainstorm) Free block count (2) Which of the following statements are true about a call to free? There may be more than one true statement in the list. Briefly explain your answer.

a. A call to free always increases the number of free blocks. b. A call to free may reduce the number of free blocks by 1, but never by more than 1. c. A call to free may leave the number of free blocks unchanged. d. A call to free may reduce the number of free blocks by 2 or more.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 180 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

37.2.9 (Brainstorm) Keeping the free list sorted As precisely as possible, indicate which code in the program maintains the list of free blocks in order of their memory addresses. Briefly explain your answer. 37.2.10 (Brainstorm) Maintaining circularity As precisely as possible, indicate which code in the program maintains the circularity of the list of free blocks. Briefly explain your answer. 37.2.11 (Brainstorm) Consequences of a user bug Suppose that a programmer using the K&R storage allocator accidentally overwrites the size of an allocated block—we’ll call it B—with a 0. What will be the effect of this accident?

a. The accident won’t cause any problem at all. b. The accident will cause a crash when the overwrite occurs. c. The accident will cause a crash when block B is freed. d. The accident may cause a crash somewhere after the overwrite, but not necessarily when block B is freed. e. The accident won’t cause a crash, but it will produce some storage that can no longer be used (a memory leak).

Briefly explain your answer. 37.2.12 (Display page) Exposing a flaw in the allocator Exposing a flaw in the allocator

First, find out the size of a block returned by sbrk (called from morecore). Then, after restarting the storage allocator, repeatedly request blocks whose size is just over half the size of a block returned by sbrk. Observe what happens, and comment in the next step. 37.2.13 (Brainstorm) The flaw Let n be the size of a block returned by sbrk. Suppose you repeatedly request a block of size a bit more than half n. What allocation pattern results, and why does it indicate a flaw in the allocator? 37.3 Compare tactics for choosing a block to return.(5 steps) 37.3.1 (Display page) First fit vs. next fit vs. best fit First fit vs. next fit vs. best fit

Suppose the free list contains more than one block that would satisfy a given allocation request. Which block should we choose to return? Three alternatives for making this choice are the following:

First fit: Search the free list linearly, stopping when we find a block that's big enough to satisfy the allocation request. Split off a smaller free block the chosen block is too big, and return. Next fit: A variation on first fit. Instead of always starting the search at the start of the free list, it keeps track of where the last search ended up, and continues from there. This approach avoids having a lot of small blocks at the start of the free list. It's what's used in the K&R code. Best fit: Search the whole free list to find the block whose size is closest to the size requested. Split off a new free block if necessary, then return.

37.3.2 (Brainstorm) First fit disadvantage Why would first fit result in a lot of small blocks at the start of the free list? 37.3.3 (Brainstorm) Advantages for next fit and for best fit Give an advantage for next fit over best fit. Also give an advantage for best fit over next fit. 37.3.4 (Brainstorm) First fit win over best fit, and vice versa Give a sequence of allocations and free that the first fit approach can satisfy, but for which the best fit approach gets stuck. Then give a sequence of allocations and frees that produces exactly the opposite result (first fit gets stuck, best fit doesn't). You may wish to work on this with a classmate (two heads are better than one). 37.3.5 (Display page) Summary Summary

As we just noted, next fit is preferable to first fit. Compared to best fit, it may produce somewhat more fragmentation, but its speed advantages usually make it the chosen approach. 37.4 Homework and readings(3 steps) 37.4.1 (Display page) Homework Homework

Homework assignment 9 and project 4 are now available. By the start of your lab on next Tuesday or Wednesday post a question about project 4 in the discussion step that follows. Then, by the start of your lab on next Thursday or Friday, post an answer to one of your labmate's questions that hasn't already been answered. 37.4.2 (Discussion Forum) Questions about project 4 By the start of your next lab, post a question about project 4 that you have or that you expect one of your labmates to have. Then, by the start of your lab on next Thursday or Friday, post an answer (here) to one of your labmate's questions that hasn't already been answered. 37.4.3 (Display page) Readings Readings

Next week, we'll be covering input and output, described in sections 6.1 through 6.7 and B.7 and B.8 in P&H. (In the third edition, this material is in sections 8.1-8.6, A.7, and A.8.)

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 181 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

38 Input/output 2009-4-28 ~ 2009-4-29 (3 activities) 38.2 Machine structures for peripheral devices(19 steps) 38.2.1 (Display page) Complexities of input and output Complexities of input and output

You type a character, it appears on the screen. You think, that's all there is to it. Well, like everything else in machine structures, a lot happens under the covers to make it all seem natural. What's needed for input and output? Input comes from and output goes to a device like a mouse (input), a printer (output), or a disk (input or output). Thus there needs to be a way to connect these devices to the computer. There must also be a way to control these devices, respond to them, and transfer data. Transferred data is essentially a sequence of bytes. The transfer takes place over a bus;

examples are shown below. Several aspects of i/o make it more complicated than dealing with memory. First, it's unpredictable: a memory location is always "ready" to be loaded, while a character from an input device might not be. Second, it's relatively slow. Transfer speed for typical devices are listed in the table below. device behavior partner data rate (KBytes/sec) keyboard input human 0.01 mouse input human 0.02 laser printer output human 100.00 music output output human 200.00 magnetic disk storage machine 10,000.00 wireless network I or O machine 10,000.00 graphics display output human 30,000.00 wired local-area network I or O machine 1,000,000.00 Compare these data rates to that of a 1 GHz microprocessor, which can execute one billion load or store instructions per second, or 4,000,000 KBytes per second. 38.2.2 (Display page) Control of character i/o Control of character i/o

Older computers had special instructions for i/o. In contrast, the MIPS, like most modern machines, uses memory-mapped i/o. A portion of the address space is dedicated to communication paths to input or output devices; input is done via a load from an address in this space, and output is done with a store. The two organizations are displayed below. special i/o instructions memory-mapped i/o

I/O devices have their own internal registers. The processor communicates with them by reading and writing these memory mapped registers. The trick is that the I/O device runs concurrently with the processor. So we need a status indicator to keep the two working together properly. A typical device has a data register and a control register. Each register is represented by a memory location as shown below. The control register contains an "am I ready?" bit that says if it's OK to read or write. The device has write access to this bit; the CPU has read access. To do input, we load from the address that corresponds to the input device's data register after checking the control register (again via load) to see if the device is ready. Output is similar: we load from the address corresponding to the control register, make sure the device is ready, and store into the device's data register.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 182 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Programs normally run in user mode. However, accessing device registers directly requires that a program be running in kernel mode (sometimes referred to as "supervisor mode"). The main idea of kernel mode is to keep the operating system from getting trashed by a user. For now, we are going to pretend we are part of the kernel and just get the I/O procedures working. The MARS interpreter simulates a single i/o device, a memory-mapped terminal (keyboard + display). The receiver reads characters from the keyboard, while the transmitter writes characters to the display. Each has a one-word control and data register, accessed by addresses 0xFFFF0000 through 0xFFFF000C as shown below in

P&H Figure B.8.1 (A.8.1 in the third edition). 38.2.3 (Display page) Trying it out Trying it out

MARS 3.6 has a tool for simulating memory-mapped i/o. Select Tools->Keyboard and Display MMIO Simulator. Click Connect to MIPS. In the main MARS window, select the 0xFFFF0000 (MMIO) portion of the data segment. This is the memory-mapped i/o area. Notice the four words that are the control registers described above. Click Reset in MMIO simulator. Which ready bit goes on? It needs to be on before you can successfully perform i/o. (This is a clunky part of the MARS I/O simulator.) Write a short program that prints an 'A' by loading it into a register and then storing it to the transmitter data word. 38.2.4 (Display page) Polling Polling

One way to do i/o is by polling. The processor reads from the control register in a loop, waiting for the corresponding device to set its "ready" bit. The processor then loads from (input) or stores into (output) the associated data register. This load or store unsets the ready bit temporarily until the i/o operation is completed. Code fragments to read and write a character using polling appear below. character input into $v0 output of character in $a0 lui $t0,0xffff #ffff0000 lui $t0,0xffff #ffff0000 waitloop: waitloop: lw $t1,0($t0) #control lw $t1,8($t0) #control andi $t1,$t1,0x0001 andi $t1,$t1,0x0001 beq $t1,$0,waitloop beq $t1,$0,waitloop lw $v0,4($t0) #data sw $a0,12($t0) #data The program ~cs61cl/code/char.s gets a single character from the keyboard and sends it to the display. Single-step the code. See how it is waiting, polling the "rcv ready" (receiver ready) bit? Type a key into the keyboard window of the MMIO tool. See it show up in the data register and the ready bit go on. Now if you continue single-stepping you see it find the "tx ready" (transmitter ready) is on and it will display the character. The help button on the MMIO tool gives more of the specifics for MARS. 38.2.5 (Display page) An echo service An echo service

Modify char.s to create echo.s, which echoes keystrokes to the display. Adjust the run speed slider so you can watch the instructions go by, say 30 instructions per second. Also in the MMIO simulator you can adjust the transmitter delay. You may need to uncheck DAD. (Don't forget about reset.) Run your echo.s at max with a 100 instruction transmitter delay. See the lag when you type. Still the characters get there because the reading from the keyboard is synched with the writing to the display. In echo, comment out the three lines of putc that wait for the display to be ready before writing to it. Run this at max with the same transmit delay. Now what happens when you type at it quickly? 38.2.6 (Brainstorm) Result of quick typing Describe what happened when you type quickly after removing the three lines of putc, and suggest a reason for this behavior. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 183 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

38.2.7 (Display page) Printing a string Printing a string

As a more complete example, take a look at ~cs61cl/code/pstring.s. Try it out. Modify it. For higher-speed I/O devices you would not want this character-by-character flow control. 38.2.8 (Display page) Processing i/o Processing i/o

When you type at a command shell it reads characters from standard input and echoes them to the display until a '\n' is encountered. Then it parses the input and executes the command. Combine your echo program and the sprint program so that it saves the input string in a character array while it echoes it to the display. At the end of a line it should echo the whole string again. 38.2.9 (Display page) The kernel The kernel

Typically user programs are not permitted to access memory-mapped I/O directly. Only the kernel has access to that portion of the address space, and it provides controlled access to the shared resources. Remember from the book that the kernel resides in the upper half of the address space. Copy over ~cs61cl/code/kecho.s and study what you find there. The getc and putc routines are in the kernel segment, while main is in the user text segment. Assemble it in MARS and look at the addresses for the instructions. Uncheck DAD in the MMIO tool to run this. Single-step and see how the program enters and leaves the kernel as it performs I/O. 38.2.10 (Brainstorm) jal vs. jalr Why must we use a jalr to jump into the kernel segment? 38.2.11 (Display page) Interrupt handling Interrupt handling

As you might guess, it's generally a waste of CPU resources to have a program constantly querying an i/o device to see if it's ready. It's like checking the clock every few minutes to see whether it's time to get out of bed. Another way to do input and output is to rely on interrupts. An interrupt is like an unplanned procedure call that's invoked only when the i/o device is ready. Setting this up requires enabling the device to interrupt. When an interrupt-enabled device becomes ready, execution is interrupted by the hardware, kernel mode is entered, and control is transferred to a special location in kernel space. The code at that address does whatever needs to happen to service the interrupt— either making it not ready or disabling further interrupts—then returns control to the formerly executing program (leaving kernel mode if appropriate). There should be no other side effects noticeable to the interrupted program. Control flow is illustrated in the diagram below. An i/o interrupt is thus asynchronous with respect to instruction execution; it can happen in between any pair of

instructions. 38.2.12 (Display page) MIPS details MIPS details

MIPS interrupts are handled through coprocessor 0. (Recall that we discussed coprocessors earlier in the context of floating-point computation. The floating-point co-processor is coprocessor 1.) Coprocessor 0 provides several registers that a programmer may use to enable interrupts and access status information such as the address of the interrupted instruction. Enabling keyboard and display interrupts on the MIPS is a two-step process. First, we turn on some bits in the coprocessor 0 Status register. Then we turn on the "interrupt enable" bit in the control word. On a real MIPS, this will cause an interrupt right away if the device is ready; MARS requires that a character be sent to the transmitter to start everything out. When an interrupt happens, control is immediately transferred to location 0x80000180. (Some computers have different entry points for the various kinds of interrupts. MIPS provides just one entry point, but provides enough information to figure out what device caused the interrupt.) The first thing that happens in the interrupt-handling code is the saving of registers that the code will use. These include the $t registers and $at, any of which may be in use in the interrupted instruction. The choreography of doing this is sometimes complex, so by convention registers $k0 and $k1 are reserved for use by the operating system (which includes interrupt handlers). Handler code goes on to determine what caused the interrupt, and to act accordingly. As noted earlier, loading a character from the receiver (keyboard) data register or storing a character into the transmitter (display) data register makes the device not ready, removing the possibility of an immediate subsequent interrupt on that device. 38.2.13 (Display page) Experimenting with an interrupt handler Experimenting with an interrupt handler

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 184 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Operating system kernels provide sophisticated interrupt handling. This is some of the trickiest code to write. You will learn a lot more about it in CS 162. Here, we'll show you some of the basics. Copy ~cs61cl/code/echoint.s and load it into MARS. The user mode is no longer doing the work here. It makes a kernel call to enable_rxint. In the MIPS, coprocessor 0 is the system coprocessor. It contains a status register that includes critical system info such as whether interrupts are enabled or disabled and, when one occurs, what is its cause. Here we disable interrupts before manipulating the I/O device. We then modify the "rcv control" register to set the bit that will tell it to signal interrupts to the processor. Set the instruction rate to something interactive, bring up the MMIO tool, get rid of DAD, and turn down the Delay length. Step through the code to see this call get performed to enable interrupts. If you let it run you will see that the user code just loops. Pause the simulation and set a breakpoint at interp. Notice that this is a very special routine. It resides at a particular hardware address—0x80000180. When an interrupt occurs, the processor stops what it is doing, saves the PC in the EPC regsiter and sets the PC to this address. All the registers belong to the interrupted code, so we can't touch any of them. Here we save everything that we might need. Normally, we would handle whatever interrupts have occurred, access I/O devices and put the data into buffers. Here we have programmed the interrupt handler to do exactly one thing—echo the key to the output. Step through this and see how it works. Try hitting a few more keys. Get rid of the breakpoint and see that this interrupt-driven I/O really does work. 38.2.14 (Brainstorm) Enabling interrupts On the MIPS with a memory-mapped device like the keyboard and output controller, a device interrupts if four bits are on. Identify three of them. 38.2.15 (Display page) Buffers Buffers

Processing of both input and output involves the use of buffers of characters. Webster defines "buffer" as "a device used as a cushion against the shock of fluctuations in business or financial activity." Here, we're worrying about the "shock" of connecting a fast producer of characters to a slow consumer of characters. An output buffer is a queue of characters. A print function adds characters to the end of the queue; the output interrupt handler removes and prints the character at the head of the queue. The file ~cs61cl/code/handler.s is a simple interrupt handler that uses a one-character buffer. Choose Settings->Exception Handler to install this code as the "official" interrupt handler. Then copy the file ~cs61cl/code/testo.s, assemble it, set the delay length to 200 or so, reset the MMIO simulator, and run it. Observe that numerous characters were dropped. In multiple runs, find the highest setting for delay length that avoids dropping of any characters. 38.2.16 (Brainstorm) Avoiding dropped characters What value did you find for delay length that was as large as possible while still avoiding dropped characters? 38.2.17 (Display page) What the buffer did What the buffer did

What good is the buffer? Try one more experiment, using the setting for delay length that you found in the previous steps. Instead of having putchar write into the buffer, have it write directly to the transmitter. Assemble the program, reset MMIO, and run, and observe the dropped characters. What's going on in general is that the fast processor is producing more characters than the slow processor (the transmitter) can consume. While the transmitter is busy, characters provided by the fast processor are stored in a buffer. Eventually the slow processor catches up, and it can start taking characters from the buffer. 38.2.18 (Brainstorm) Sizes of input and output buffers Would you expect a buffer to store characters input from the keyboard to be larger, smaller, or about the same size as a buffer to store characters output from a program? Briefly explain. 38.2.19 (Display page) Summary Summary

Today's activities introduced you to input and output processing. Here are things we would like you to understand about this topic:

the two strategies for communicating with an i/o device, namely polling and interrupt handling; the steps to get a character printed or read; the structure of an interrupt handler and the constraints under which it operates; the role of a buffer in processing input or output.

38.3 Homework and readings(2 steps) 38.3.1 (Display page) Homework Homework

Reminder: By the start of next lab, you are to post a response to one of the questions about project 4 in the discussion step of April 23 and 24. 38.3.2 (Display page) Upcoming topics Upcoming topics

The next lab will deal with generalized exception handling, of which interrupt handling is a special case. Relevant text material appears in chapter 6 and Appendix B, sections 7 and 8 (chapter 8 and Appendix A in the third edition). The appendix is available online here.) 39 Disks and exceptions 2009-4-30 ~ 2009-5-01 (3 activities) 39.1 See how hard disks work.(10 steps) 39.1.1 (Display page) Mechanical operation of disks

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 185 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Mechanical operation of disks

A magnetic disk is one of the most important computer peripherals—and one of the most complex. It is a kind of storage device that works by magnetizing ferrite material on the surface of a rotating platter. It's like a tape recorder (remember tape recorders?) that records bits instead of music or speech. It's nonvolatile storage; that is, it retains its state without external power. You are probably familiar with several kinds of magnetic disks:

"Floppy" disks, so called because they once used a flexible recording medium (again like tapes). These are removable disks (i.e. you insert them into the disk drive when you need to use them). They are relatively slow and not very densely recorded. They have largely been replaced by USB flash devices and CD-ROMs. Hard disk drives. These are faster and recorded more densely. They're usually not removable.

Hard drives are an amazing piece of aeronautical design; the disk head flying over the surface of the disk has been likened to flying a 747 a foot off the ground at mach 2. They travel so close to the surface that less than a wavelength of visible light separates them. And yet, you can go to Staples and buy a terabyte disk for a couple hundred dollars. (About 10 years ago with Prof. Dave Patterson we built a 4 TB store and required seven full racks of PCs and disks, big donations from IBM and Microsoft, and lots of graduate student effort! Now you walk down to the office supply store.) The rest of this segment of activities will focus on hard disks. They're used as long-term inexpensive storage for files, and as backup for main memory (they're the large, inexpensive, slow level in the memory hierarchy). Two photos of hard disks appear below. From October 2003 National Geographic

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 186 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

More detail

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 187 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

The second photo above shows twelve platters. These usually have information recorded magnetically on both surfaces. Bits are recorded in tracks, which in turn are divided into sectors. There are typically 10,000-50,000 tracks per surface. A typical sector size is 512 bytes. The diagram below illustrates this information.

The tracks and sectors are numbered, and the operating system keeps track of what tracks and sectors contain information it has written. In an input or output operation, the actuator moves the disk head (at the end of the arm) to the desired track, waits for the desired sector to rotate under the head (the sector number is stored on the disk), then does the read or write. Originally, all tracks had the same number of bits; thus the bit density was lower the further one went from the center of the platter. In the early 1990's, disk drives changed to keep the spacing between bits constant. The bandwidth of an outer track is as much as twice the bandwidth of an inner track. Hard drives are sealed. Why? To keep the dust out. The head is very close to the surface of the disk—5-20 nanometers—since the closer the head is to the disk, the denser the recording. 99.999% of the head and arm weight are supported by the air cushion developed between the disk and the head. Dust particles are 500-1500 nanometers in size, so they can have a big effect on hard drive performance. A disk controller usually handles the detailed control of the disk and the transfer between the disk and the memory. Most disk controllers have built-in caches that stores sectors as they are passed over. 39.1.2 (Display page) Disk performance Disk performance

Speed of a disk read or write to a sector—the disk latency—depends on four things:

the seek time to find the appropriate track; the rotation time to find the right sector on the current track; the transfer time to do the read from or write to the specified sector; controller overhead time.

Seek time depends on the number of tracks and the speed of the actuator. Rotation time depends on how fast the disk rotates (obviously), and on how far the desired sector is from the head. Transfer time depends on the data rate (bandwidth) of the disk and the size of the request; bandwidth is determined by the bit density and the rotation speed of the disk. Manufacturers quote values for maximum, minimum, and average seek times in their literature. P&H note that the quoted average is computed by finding the sum of the time for all possible seeks and dividing by the number of possible seeks, and advertised as 3-14 ms; however, this figure is likely to be greater than the actual average because of locality of disk references. 39.1.3 (Self Test) Desired place to put data Consider a disk with one platter. Sections of the disk are shown below in blue. Which sections would allow quickest data transfer? http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 188 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

39.1.4 (Display page) Average transfer time Average data transfer time

Let's calculate the time to read one sector (512 bytes) on the Deskstar 7K500 disk described below.

Rotation time = 7200 revolutions per minute = 7200/60000 revolutions per millisecond. (1/2 rotation = 4.17ms.) Average seek time = 8.5ms. Transfer rate = 102 megabytes per second. Controller overhead time = 0.1 ms.

We add the average seek time (8.5ms), the average rotational delay (half the rotation time, or 4.17ms), the transfer time (0.5K/100KB/ms = 0.005ms), and the controller overhead time (0.1ms) to get the average time to read a sector. The total is 12.77ms. 39.1.5 (Brainstorm) Cycles per transfer? We just computed an average figure of 12.77ms to read a sector from the disk. How many clock cycles is that? State your assumption about clock rate, etc. How much of this time would you expect that the CPU needs to involved in actually dealing with the disk as opposed to just waiting for the disk to complete a step? What difference would polled vs interrupt driven I/O make? 39.1.6 (Brainstorm) More data transfer? We computed the average time to read a sector by adding the average seek time (8.5ms), the average rotational delay (half the rotation time, or 4.17ms), the transfer time (0.5K/100KB/ms = 0.005ms), and the controller overhead time (0.1ms). The total is 12.77ms. Determine the time to read ten sectors of data, under each of the following assumptions:

a. The sectors are consecutive on a track. b. The sectors are all on the same track but are not in any particular sequence on the track. c. The sectors are scattered all over the disk.

Briefly explain your answers. 39.1.7 (Brainstorm) Fragmentation? Several of the issues that arose when we covered memory storage management (i.e. the implementation of malloc and free) also arise in the context of disk storage management. In particular, there is the issue of disk fragmentation, that is, the chopping up of disk storage into chunks that are too small to satisfy storage requests on their own. As fragmentation increases on a disk, what will happen to average access time for newly allocated files? I.e., will the access time for these files increase, stay the same, or decrease? Briefly explain your answer. What about access time for old files that have been on disk unchanged for a long time? 39.1.8 (Brainstorm) Swap space fragmentation? Suppose we're running a paged virtual memory system. Do we have to worry about fragmentation in swap space? Why or why not? 39.1.9 (Display page) Trends Trends

The trend is toward increasing capacity via increasing areal density (bits per square inch), increasing rotation speed, and decreasing cost per byte. Trends up through the early 2000's are here; more general information about disks is here. A data point for comparison: Hitachi just released the CinemaStar 7K1000, with the following characteristics:

1 terabyte, 3.5" disk 7200 rpm (1/2 rotation = 4.17ms) five platters, ten surfaces; 200 gigabytes per platter 100 gigabit per square inch areal density internal transfer rate = 1070 megabytes per second; external transfer rate = 300 megabytes per second (see here for an explanation of the difference) cost = $200, twenty cents per gigabyte

39.1.10 (Brainstorm) Limits? Areal density and transfer rate seems to be increasing by around 60-100% per year. Suggest one or more factors that might limit improvement in these areas in the future. 39.2 Get further acquainted with exception handling.(7 steps) 39.2.1 (Display page) Exceptions Exceptions

Interrupts are a special case of exceptions. The term "exception" refers to any unscheduled event that disrupts program execution; an interrupt is an exception whose cause is outside the CPU (e.g. from an i/o device). Interrupts can be serviced between instructions. Traps and faults are exceptions whose cause is internal to the CPU. Often they occur within the processing of an instruction, so we cannot just let the instruction finish and then handle the exception. A list of some MIPS exceptions appears below.

Interrupt Bus error (impossible address) Address error (alignment error or attempt to address kernel space while in user mode) Overflow (arising from add or subtract operation) System call (execution of the syscall instruction) Breakpoint (execution of the break instruction) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 189 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Reserved instruction (unrecognized instruction, or instruction that's being implemented in software rather than hardware) Coprocessor unusable (probably not attached) TLB miss TLB modified (a store instruction changes something on one of the pages referenced in the TLB)

Notice that the "reserved instruction" exception provides the system with a way to emulate hardware instructions in software, similar to how you are doing it in project 4. This was done with floating point operations in early MIPS implementations. The TLB exceptions are quite unique to the MIPS; essentially all other architectures handle the TLB miss in hardware. Handling it in software saves some transistors, but requires some very careful operating systems coding. Here's what the MIPS CPU does when it takes an exception.

1. It puts the address of the instruction in the interrupted program to resume in coprocessor 0's Exception Program Counter (EPC) register. 2. It enters kernel mode and disables interrupts (from outside the CPU). 3. It puts a code specifying what kind of exception is being handled in coprocessor 0's Cause register. 4. Then it sets the PC to 0x80000180 and starts fetching instructions from there.

Then the software takes over. It saves registers, determines the reason for the exception from the Cause register, and uses that to jump to appropriate handler code. 39.2.2 (Brainstorm) Why are syscalls treated like exceptions? We have previously used the syscall instruction for input and output. Why would these operations have to be done in kernel mode rather than user mode? 39.2.3 (Brainstorm) Why disable interrupts? Why should the hardware disable interrupts as part of switching to the exception handler? 39.2.4 (Brainstorm) Why not use the user stack? The interrupt handling code you worked with last lab used the user stack to save registers. For a full-fledged exception handler, however, this is a bad idea. Why? 39.2.5 (Display page) Nested exceptions Nested exceptions

MIPS provides a little bit of hardware support for handling nested exceptions; it keeps a three- element "stack" of mode (user or kernel) and interrupt-enable settings in coprocessor 0's Status register. On an exception it shifts these bits over and brings in two new ones. As just suggested, part of kernel space must be set aside for a trusted stack to hold saved values. Other aspects of software support for nested exceptions are somewhat tricky. For example, the exception handler must save the Exception Program Counter's value before any additional exceptions can be permitted to occur. Also, the exception handler can't count on the values in the $k0 and $k1 registers after return from a nested exception. On the other hand, most exceptions can be avoided in an exception handler. For example, the handler shouldn't be making any addressing errors. 39.2.6 (Brainstorm) Most likely nested exception candidates? Here's a list of common MIPS exceptions.

TLB miss TLB modified (a store instruction changes something on one of the pages referenced in the TLB) Bus error (impossible address) Address error (alignment error or attempt to address kernel space while in user mode) Overflow (arising from add or subtract operation) System call (execution of the syscall instruction) Reserved instruction (unrecognized instruction, or instruction that's being implemented in software rather than hardware) Coprocessor unusable (probably not attached)

(Recall that the exception handler has disabled interrupts.) Which of the exceptions are most likely to occur while processing another exception? Also, remember that any page fault will be preceded by a TLB miss, since the TLB only holds valid page table entries. 39.2.7 (Display page) Returning from an exception Returning from an exception

Returning from an exception is straightforward. We restore saved register values and execute the eret instruction (Exception RETurn), which restores the interrupted program's mode (user or kernel) and sets the PC to the Exception Program Counter. Note that the return of control to the interrupted instruction and the restoring of the interrupted program's mode have to be done in a single instruction. Running even a single instruction of application code in kernel mode would be a big security hole, while trying to execute a kernel instruction in user mode would just cause a crash. 39.3 Readings(1 step) 39.3.1 (Display page) Upcoming topics Upcoming topics

Next week's lab activities will deal with threads. The tutorial POSIX Thread Programming will be good background. 40 Threads 2009-5-05 ~ 2009-5-06 (3 activities) 40.2 Work with pthreads.(13 steps) http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 190 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

40.2.1 (Display page) Hardware parallelism A new approach to microprocessor design

We've seen one way to speed up a CPU, namely pipelining. P&H describe another approach in section 4.10: duplicating internal components of the CPU so that it can launch multiple instructions in every pipeline stage. This technique is called multiple issue. If managed by the hardware, it depends on (lots of) additional circuitry to keep track of dependencies of one set of instructions on others, not to mention hazards that we encountered in earlier activities involving the MIPS pipeline. There are tradeoffs to making more and more complicated circuitry, specifically the power required to run the CPU. Figure 1.15 in P&H, copied below, illustrates how power use caught up with clock rate in recent Intel microprocessors. This forced a design change that moves the focus from optimizing a single program running on a single processor to optimizing the throughput that results from multiple processors—called "cores"—per chip. Figure 1.17 shows the number of cores, power, and clock rates of recent microprocessors.

P&H Figure 1.15

P&H Figure 1.17 The big challenge now is for programmers to take advantage of the hardware parallelism. 40.2.2 (Display page) Software parallelism: threads Software parallelism: threads

One might imagine process-level parallelism, where each process gets a piece of its own core. More relevant for these activities, however, are parallel processing of threads, "sub-parts" of a process. A process has its own stack, heap, and code and data space. Threads, on the other hand, exist within a process and share information and resources with one another. They contain only the barest minimum of resources that enable them to exist as executable code. The POSIX Thread Programming Document summarizes thread properties in a UNIX environment:

A thread exists within a process and uses the process resources. Each thread has its own independent flow of control as long as its parent process exists and the OS supports it. A thread duplicates only the essential resources it needs to be independently schedulable. Threads may share process resources with other threads that act equally independently (and dependently). A thread dies if its parent process dies. Changes made by one thread to shared system resources (such as closing a file) will be seen by all other threads. Two pointers having the same value point to the same data. Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer.

When compared to the cost of creating and managing a process, a thread can be created with much less operating system overhead. Managing threads requires fewer system resources than managing processes. All threads within a process share the same address space. Inter-thread communication is more efficient and in many cases, easier to use than inter-process communication. We assume that some part of the operating system is managing the threads: interrupting an executing thread, choosing the thread that's next to execute, and starting or resuming that thread's execution. In general, we cannot predict the sequence in which threads are chosen for execution. The goal of the subsequent activities is to give you some idea of that and other concerns faced by a programmer who wants to write programs that take advantage of parallel resources. 40.2.3 (Display page) Using pthreads http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 191 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Using pthreads

Numerous hardware vendors have provided their own implementations of threads, but there was no standard; it thus was extremely difficult for programmers to develop portable threaded applications. To address this problem, a standard API and library was designed for UNIX systems. In this implementation, pthreads are defined as a set of C types and function calls provided via a pthread.h include file and a thread library. This library is provided on a cluster of 26 compute nodes donated by Intel and Google. The front-end compute node is icluster.eecs.berkeley.edu. To access the pthread primitives, provide the line

#include at the start of your C file. Also add the option

-lpthread

(that's minus-ell-pthread) at the end of your gcc command. 40.2.4 (Display page) A threaded hello world A threaded "hello world"

Start by copying the HelloWorld.c file in the directory ~cs61cl/code/threads. Your main immediate task is to create some threads. This is done with the pthread_create function:

int pthread_create (pthread_t *thread, pthread_attr_t attr, void* (*start_routine)(void*), void *arg) Some information about this function:

A value of type pthread_t is a struct that represents a thread. Passed a pointer to a pthread_t, pthread_create stores information in the corresponding struct. We won't use the attr parameter. NULL is an acceptable placeholder. The start_routine argument is the name of the function that the thread will start executing. The arg argument is a value of ambiguous type (thus the void * typing) that will be passed to the start_routine that is called when the thread begins.

Thus a typical call looks like the following.

pthread_t another_thread; pthread_create (& another_thread, NULL, startingFunction, 1234); This call creates and runs a new independent thread of execution which passes the value 1234 as an argument to startingFunction and begins executing that function. One other item of interest: When a pthread calls pthread_yield, it relinquishes use of its processor and waits in the run queue until it is scheduled again for execution. Note: The implementation of pthreads we are using seems to yield control of execution to the newly spawned thread right after creating it. This is only a quirk of our implementation and is not part of the pthread standard. Remember that properly synchronized programs should make no assumptions about how threads are scheduled, so do not rely on this quirk of the implementation in your program. Add a call to HelloWorld.c to pthread_create to create threads that call the sayHello function. A number of compiler warnings will appear, complaining about the use of void *; disregard them. 40.2.5 (Brainstorm) No goodbyes All the threads were supposed to print their good-bye messages ("Bye! ..."). Explain why they didn't. 40.2.6 (Display page) A solution A solution

When the original thread (the one that started with main) of a program exits, the entire process quits and thus all threads within that process are killed. In this case, the original thread creates child threads and relinquishes execution control to them because of the implementation quirk mentioned above. As soon as the original thread regains control, however, it immediately exits without waiting for any of its child threads to complete, thus killing all of them. To make the original thread wait for its child threads to complete, we must add another loop to HelloWorld.c with a call to pthread_join, which has the following prototype:

int pthread_join(pthread_t thread, void **value_ptr); When thread A (the caller) joins thread B (the argument), thread A will not continue out of the call to pthread_join until thread B has completed and exited. The first argument is the pthread that the caller thread will join. We will not use the second argument. Add a loop to HelloWorld.c so that the main thread joins each child thread before main finishes and returns. Your new program should result in all the child threads printing goodbye. 40.2.7 (Brainstorm) Actually only a partial solution ... What weird aspects of the output did you notice? 40.2.8 (Display page) Sharing memory Sharing memory

Things get more complicated when two threads are sharing and potentially updating memory. Recall that we can't predict when a thread will be suspended or resumed. Here's an example. Suppose thread A and thread B both want to add 1 to a variable x. Each executes the program segment

lw $t0,x addi $t0,$t0,1 sw $t0,x http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 192 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

There are 6 choose 3 = 20 = the number of ways to interleave A's execution with B's. For example, one such interleaving is the following. Suppose x starts out at 0. A executing lw $t0, x A's $t0 contains 0. B executing lw $t0, x B's $t0 contains 0. B executing addi $t0, $t0, 1 B's $t0 contains 1. A executing addi $t0, $t0, 1 A's $t0 contains 1. A executing sw $t0, x A stores 1 into x. B executing sw $t0, x B stores 1 into x. 40.2.9 (Brainstorm) The problem? One might argue that x should have value 2 in the example just presented, since both A and B want to increment it. Why didn't it work? 40.2.10 (Brainstorm) Odds of success? How many of the 20 possible interleavings of execution of segments of A with execution of segments of B correctly leave 2 in x? Briefly explain how you got your answer. 40.2.11 (Display page) A mutex A mutex

The problem just noted arises from the failure to maintain a critical section, a sequence of instructions that must not be interrupted by any thread that's using x. We handle this problem with a construct called a mutex (short for "mutual exclusion"), which acts like a lock around a critical section. The lock is engaged at the start of the section and released at the end. A pthread mutex has type pthread_mutex_t. It is initialized via a C macro named PTHREAD_MUTEX_INITIALIZER. Engaging a lock is done with the pthread_mutex_lock function, and releasing it is done with pthread_mutex_unlock; both take an initialized mutex as argument. Here's some typical lock code.

// as a global variable pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;

// before a critical section corresponding to lock1 pthread_mutex_lock (&lock1); // critical section pthread_mutex_unlock (&lock1); // after critical section The critical section should be as small as possible, otherwise we're running threads one at a time and avoiding the benefits of parallel programming! 40.2.12 (Display page) Mutex use Mutex use

Edit the sayHello function so that the threads that call it will first try to acquire the counter_lock before incrementing the global counter variable. After doing this, each thread should number itself properly while saying goodbye. 40.2.13 (Display page) Ordering the pthreads Ordering the pthreads

Now we want to have our threads say goodbye in a specific order: we want those with an even id to print first, and those with an odd id to print after all the evens have printed. Since we are not guaranteed that the threads will start in any given order, we must have the odd threads wait until all the even threads have printed. (We won't care about the order they print hello.) We do this with the aid of conditional variables (CV). A condition variable is a construct that allows threads to sleep until a certain condition is met, where the condition is determined by some shared information (such as a counter variable). CVs are always associated with locks because the shared information that they depend on must be synchronized across threads. You can think of a CV as a "big pillow" in the sense that threads can fall asleep on a condition variable and be awakened from that condition variable. Here's a story. Tommy the Thread wanted to access shared information. He acquired the appropriate lock but then was disappointed to see that the shared information wasn't ready yet. Without anything else to do, he decided to sleep (or wait) on a nearby condition variable until another thread updated the shared information and woke him up. Tammy the Thread finally came along and updated the shared information while Tommy was asleep on the nearby CV. Tammy noticed Tommy asleep, so Tammy signaled him to wake up off the condition variable. Tammy then went on her way and left Tommy to play with his new, exciting shared information. The above story sounds like a pretty useful thing for threads to be able to do, but there's one thing we glossed over: we know that to access the shared information, a thread needs to hold the lock. Since Tammy needs to update the shared information while Tommy is asleep and waiting, we must have any thread that sleeps on a condition variable release the lock while it is asleep. But since Tommy had the lock when he fell asleep, he expects to still have the lock when he wakes up, and so the condition variable semantics guarantee that a thread sleeping on a condition variable will not fully wake up until (1) it receives a wake-up signal and (2) it can re-acquire the lock that it had when it fell asleep. For more of a description (with no silly story), you can check out POSIX Thread Programming. To achieve the evens-print-first functionality, you will need to use the condition variable routines in the pthread library:

pthread_cond_wait (pthread_cond_t *condition, pthread_mutex_t *lock); pthread_cond_broadcast (pthread_cond_t *condition);

The pthread_cond_wait routine will put the caller thread to sleep on the condition variable condition and release the mutex lock, guaranteeing that when the subsequent line is executed after the caller has awakened, the caller will hold lock. The pthread_cond_broadcast routine wakes up all threads that are sleeping on condition variable condition. IMPORTANT: To call any of the above routines, the caller must first hold the lock that is associated with that condition variable. http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 193 of 194 Curriculum Builder Portal 2/11/10 8:55 AM

Failure to do this can lead to several problems. Check these global variables in your HelloWorld.c file:

pthread_mutex_t print_order_lock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t evens_done_CV = PTHREAD_COND_INITIALIZER; int number_evens_finished = 0; Add the logic necessary to have all the evens print before the odds. You will need to edit the sayHello function to contain calls to pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_wait, and pthread_cond_broadcast. Do not use pthread_cond_signal. 40.3 Activities for next lab(1 step) 40.3.1 (Display page) Project work, question handling, course survey Project work, question handling, course survey

There will be no assigned activities for lab on May 7 and 8 other than a course survey. Spend the time working on project 4 or clearing up questions arising from earlier lab activities. 41 The last lab 2009-5-07 ~ 2009-5-08 (1 activity) 41.1 Please provide us with some information.(4 steps) 41.1.1 (Display page) Survey and discussion Survey + discussion

There are three activities to do before the final exam. One is a (rather long) survey, which we hope will provide us with information about how lab-centric instruction compares to its traditional-format counterpart and how to make it better. The next two are discussion steps; we hope that your responses will reveal information about how to improve the use of "brainstorm" questions and peer evaluation, not just in CS 61CL but in other classes run in the lab-centric format. 41.1.3 (Discussion Forum) Discussion of "brainstorm" questions We'd like your opinions and suggestions about the use of "brainstorm" questions. Please also comment on your labmates' opinions. 41.1.4 (Discussion Forum) Discussion of peer evaluation We'd like your opinions and suggestions about the use of peer evaluation. Please also comment on your labmates' opinions.

http://spring09.ucwise.org/builder/builderPortal.php?BUILDER_menu=curriculumSummary Page 194 of 194