Introduction CS2253 ● Goal: write a simple C program and understand ● Why and what for 2253 how the computer actually executes it. ● This year, we study the ARM7TDMI processor. ● Levels of abstraction ● Last year, we used the fictional LC-3. ● Instruction Set Architectures ● The Church-Turing thesis essentially states that all systems with a certain minimal computational ● Major parts of any computer capability are able to compute the same things ● von Neumann architecture as each other. So LC3 vs ARM vs does ● Flow of control not matter, at least theoretically. ● And in actuality, LC3 and ARM etc. are fundamentally similar. ● Easy to pick up a second machine....

Levels of Abstraction Textbook Fig 1.9

● From atoms, we build transistors. [Lowest level]

● From transistors, we built gates (ECE courses).

● From gates, we build microarchitectures (CS3813) that implement machine instructions.

● From machine instructions, we can make simple programs directly (first part of this course).

● Or a compiler can string together machine instructions corresponding to our C code (second part).

● From simple pieces of C code, we can build OSes, databases, other complex systems. [Highest level]

Why Should I Care About the Lower Microarchitecture example (block Levels? diagram, book Figure 1.4)

● If you want to work in the computer hardware field, it's obvious.

● If you want to work in software: – it's sad if you aren't intellectually a little bit curious about how things really happen. – when things fail, you tend to need to “see behind the veil” to understand what is going wrong. Debugging often requires a lower-level view. – it helps you understand why some operations would be fast, while others would be slow. (Performance debugging.)

Instruction Set Architecture Multiple implementation of an ISA

● ISA is an important concept. In Java terms, it is like an ● In Java, several classes can implement the same Interface. Interface to the microarchitecture (hardware). ● So, several microarchitectures can have the same ISA. They ● It specifies all the things that you would need to know, to can run the same machine instructions as each other. write a machine-instruction program: ● IBM mainframes in the 1960s: fast implementation if you're – what are the basic instructions? rich, slow ones if not.

– how does the machine find the data for instructions? ● Today: AMD and Intel processors can run the same code. – what are the basic data types supported (bit lengths)? ● But computer designers like to extend ISAs over time. – how is memory organized? Backward compatibility is the goal. No bad idea ever – how and where are the instructions stored? dropped. Ugly messes like Intel x64 architecture.

Textbook Fig 1.5 Major Parts of Any Computer

● An input and an output facility (from/to people or devices)

● Memory/Memories to store data and programs

● A “datapath” with the logic needed to add, multiply, store ... binary values, etc.

● A “controller” that goes through the program and makes the datapath do the required operations, one at a time.

● Controller + Datapath = Processor (aka CPU)

● Controller is the “puppeteer” and the datapath is the “puppet”.

Textbook Figure 1.2: Block diagram John von Neumann's architecture of a SOC (system on chip)

● John von Neumann was a brilliant mathematician (and statistician and nuclear weapons guy and father of game theory and inventor of MergeSort and ...) who wrote an influential report in 1946 with a computer design.

● A von Neumann architecture stores the program in (a different part of) the same memory that stores the data.

● A Harvard architecture uses separate memories.

● Modern computers appear to be von Neumann, but behind the scenes are a bit Harvard-ish.

● Except for some special-purpose machines.

Textbook Figure 1.8 von Neumann and the IAS, 1952

How a von Neumann Architecture Be a Human CPU Works

● The program is a list of instructions located in memory. Memory Address ● “Hokey Pokey” Part of the control unit is the Program Counter (PC), 1000 right hand in program which points to the exact place for the current instruction. 1001 right hand out 1002 right hand in ● PC starts at 1000 ● while (true) { 1003 shake it all about Fetch current instruction from memory 1004 turn yourself around ● Fetch and do “right Increment PC 1005 go back to address hand in”; PC ← 1001 Inspect current instruction and do what it says 1000 } ● Fetch and do “right ● This is the Fetch-Execute cycle. hand out”;PC ← 1002 ● ….

Control Flow More Realistic Instructions

● Normally, when you finish one instruction you advance to the (sequentially) next one. ● Instead of “right hand in”, a CPU instruction will be something simple like a request to take two integers ● This is called straight line flow of control. PC just stored locally in the CPU, add them, and store the result increments. in the CPU ● But there are instructions like “go back to the ● Or “go to memory location 2000 and load the 8-bit instruction at address 1000” that disrupt this. integer there into the CPU” ● Good thing, since that's how we can do IF and ● Or “go back to memory address 1000”, as in the Hokey WHILE in a high-level language. Pokey program. A jump or branch instr. ● Control flow is often disrupted conditionally, ● A compiler or a programmer needs a lot of simple based on status flags that record what happened instructions to do something more complicated. earlier (eg, did last addition give -ve result?)

Data Bits and Bit Sequences

● (Some repeating CS1083, ECE course) ● Fundamentally, we have the binary digit, 0 or 1. ● More interesting forms of data can be encoded into a bit sequence. ● bits and bit sequences ● 00100 = “drop the secret package by the park entrance” ● integers (signed and unsigned) 00111 = “Keel Meester Bond”

● bit vectors ● A given bit sequence has no meaning unless you ● strings and characters know how it has been encoded.

● ● floating point numbers Common things to encode: integers, doubles, chars. And machine instructions. ● hexadecimal and octal notations

Encoding things in bit sequences How Many Bit Patterns? (From textbook)

● Floats

● With k bits, you can have 2k different patterns ● 00..00, 00..01, 00..10, … , 11..10, 11..11 ● Remember this! It explains much... ● Machine Instructions ● E.g., if you represent numbers with 8 bits, you can represent only 256 different numbers.

Names for Groups of Bits Unsigned Binary (review)

● nibble or nybble: 4 bits ● We can encode non-negative integers in unsigned binary. (base 2) ● octet: 8 bits, always. Seems pedantic. ● 10110 = 1*24 + 0*23 + 1*22 + 1*21 +1*20 represents the ● byte: 8 bits except with some legacy systems. In mathematical concept of “twenty-two”. In decimal, this course, byte == octet. this same concept is written as 22 = 2*101 + 2*100.

● after that, it gets fuzzy (platform dependent). ● Converting binary to decimal is just a matter of adding For 32-bit ARM, up powers of 2, and writing the result in decimal. ● – halfword: 16 bits Going from decimal to binary is trickier. – word: 32 bits

Division Method (decimal → binary) Subtract-powers method

● Repeatedly divide by 2. Record remainders as ● Find the largest power of 2, say 2p, that is not you do this. larger than N (your number). The binary number has a 1 in the 2p's position. ● Stop when you hit zero. ● Then similarly encode N-2p. ● Write down the remainders (left to right), starting with the most recent remainder. ● Eg, 22 has a 1 in the 16's position 22-16=6, which has a 1 in the 4's position 6-4 = 2, which has a 1 in the 2's position 2-2=0, so we can stop....

Adding in Unsigned Binary Fixed-Width Binary Integers

● Just like grade school, except your addition ● Inside the computer, we work with fixed-width values. table is really easy: ● Eg, an instruction might add together two 16-bit unsigned binary values and compute a 16-bit result. ● No carry in: 0+0=0 (no carry out) ● Hope your result doesn't exceed 65535 = 216-1. Otherwise, you 0+1= 1+0 = 1 (no carry out) have overflow. Can be detected by a carry from the leftmost 1+1= 0 (with carry out) stage. ● If a result would really doesn't need all 16 bits, a process called ● Have carry in: 0+0=1 (no carry out) zero-extension just prepends the required number of zeros. 0+1 = 1+0 = 0 (with carry out) ● 10111 becomes 0000000000010111. 1+1 = 1 (with carry out) ● Mathematically, these bit strings both represent the same number.

Some Ways to Encode Signed Signed Numbers Numbers

● A signed number can be positive, negative or ● All assume fixed width; examples below for 4 bits zero. ● Sign/magnitude: first bit = 1 iff number -ve ● An unsigned number can be positive or zero. Remaining bits are the magnitude, unsigned binary

● Note: “signed number” does NOT necessarily Ex: 1010 means -2 mean “negative number”. ● Biased: Store X+bias in unsigned binary Ex: 0110 means -2 if the bias is 8. (8+(-2) = 6) In Java, ints are signed numbers. Can they be positive?? Can they be negative?? ● Two's complement: Sign bit's weight is the negative of what it'd be for an unsigned number Ex: 1110 means -2: -8+4+2 = -2

● You can generally assume 2's complement...

Why 2's Complement? 2's Complement Tricks

● There is only one representation of 0. (Other ● +ve numbers are exactly as in unsigned binary representations have -0 and +0.) ● Given a 2's complement number X (where X may be -ve, +ve or zero), compute -X using the twos ● To add 2's complement numbers, you use complementation algorithm (“flip and increment”) exactly the same steps as unsigned binary. ● Flip all bits (0s become 1s, 1s become zeros) ● There is still a “sign bit” - easy to spot negative ● Add 1, using the unsigned binary algorithm numbers ● Ex: 00101 = +5 In 5 bit 2's complement ● You get one more number (but it's -ve) 11010 + 1 → 11011 is -5 in 2's complement N-1 N-1 Range of N bits 2's complement: -2 to +2 -1 ● And Flip(-5)=00100. 00100+1 back to +5

Converting a 2's complement Sign extension number X to decimal

● Determine whether X is -ve (inspect sign bit) ● Recall zero-extension is to slap extra leading zeros onto a number. ● If so, use the flip-and-increment to compute -X ● Eg: 5 bit 2's compl. to 7 bit: 10101 → 0010101 Pretend you have unsigned binary. Oops: -11 turns into +21. Zero extension didn't Slap a negative sign in front. preserve numeric value. ● The sign-extension operation is to slap extra copies ● If number is +ve, just treat it like unsigned of the leading bit onto a number binary. ● +ve numbers are just zero extended

● But for -11: 10101 → 1110101 (stays -11)

Overflow for 2's complement Numbering Bits

● Although addition algorithm is same for fixed-width ● On paper, we often write bit positions above the actual data bits. unsigned, the conditions under which overflow occurs are ● 543210 ← normally in a smaller font than this different. 001010 bits 3 and 1 are ones.

● If A and B are both same sign (eg, both +ve), then if A+B is ● Sometimes we like to write bits left to right, and other times, right to the opposite sign, something bad happened (overflow) left (which is more number-ish). We usually start numbering at zero. ● Inside computer, how we choose to draw on paper is irrelevant. ● Overflow always causes this. And if this does not happen, there is no overflow. ● Computer architecture defines the word size (usually 32 or 64). Usually viewed as the largest fixed-width size that the computer can ● Eg, 1001 + 1001 →0010 but -7 + -7 isn't +2. handle, at maximum speed, with most math operations. Note that -14 cannot be encoded in 4-bit 2's complement. ● So bit positions would be numbered 0 to 31 for a 32-bit architecture.

More Arithmetic in 2's complement Bit Vectors (aka Bitvectors)

● Subtract: To calculate A-B, you can use A + (-B) ● Sometimes we like to view a sequence of bits as Most CPUs have a subtract operation to do this for you. an array (of Booleans)

● Multiplication: easiest in unsigned. (Most CPUs have instr.) ● Eg hasOfficeHours[x] for 1 <= x <= 31 ● D.I.Y. unsigned multiplication is like Grade 3: says whether I hold office hours on the xth of this But your times table is the Boolean AND !! month. The product of 2 N-bit numbers may need 2N bits ● And isTuesday[x] says whether the xth is a ● For 2's complement, the 2 inputs' signs determine the product's sign. eg, -ve * -ve → +ve Tuesday. ● And you can multiply the positive versions of the two numbers. ● So what if you want to find a Tuesday when I hold Finally, correct for sign. office hours?

Bitwise Operations for Bit Vectors Find First Set

● Bitwise AND of B1 and B2: ● Some ISAs have a Find First Set instruction. Bit k of the result is 1 iff bit k of both B1 and B2 is 1. (You've got a bitvector marking the Tuesdays ● Java supports bitwise operations on longs, ints when I have office hours – but now you want to int b1 = 6, b2 = 12; // 0b110, 0b1100 find the first such day.) int result = b1 & b2; // = 4 or 0b100 ● Integer.numberOfTrailingZeros() in Java ● Bitwise NOT (~ in Java) achieves this. ● Bitwise OR ( | in Java) ● ● Bitwise Exclusive Or ( ^ in Java) Also write “XOR” or “EOR”. So use

● Pretty well every ISA will support these operations directly. Integer.numberOfTrailingZeros(hasOfficeHours & isTuesday)

Bit Masking Bit Masking with AND

● Think about painting and masking tape. You can put a ● AND(x,0) = 0 for both Boolean values of x piece of tape on an object, paint it, then peel off the tape. ● AND(x,1) = x for both Boolean values of x Area under the tape has been protected from painting.

● ● bitwise AND wants to paint bits 0, except where the mask protects (1 We can do the same when we want to “paint” a bit vector protects) with zeros, except in certain positions. ● hasOfficeHours & 0b1111111111 ● Eg, I decide to cancel my office hours except for the first is a bitvector that keeps my office hours for the first 10 days (only). Later 10 days of the month. in month, all days are painted false. ● hasOfficeHours &= 0b1111111111 modifies hasOfficeHours. By analogy to ● Or we can protect positions against painting with ones. the += operator you may already love. ● The value 0b111111111 is being used as a mask. ● Details next... ● Quiz: what does hasOfficeHours & ~0b1111111111 do?

Bit Masking with OR Bit Masking with EOR (aka XOR)

● OR(x,1) = 1 for both Boolean values x ● EOR(0,x) = x for both Boolean values x ● OR(x,0) = x for both Boolean values x ● EOR(1,x) = NOT(x) for both Boolean values x

● bitwise OR wants to paint bits with 1s, except where the ● bitwise EOR wants to flip bits in positions that are not mask prevents it. A 0 prevents painting. protected with a 0 in the mask.

● hasOfficeHours | 0b1111111111 is a bitvector where I ● hasOfficeHours ^= 0b111100 have made sure to hold office hours on each of the first inverts my office hour situation for Jan 3-6. 10 days (and left things alone for the rest of the month) ● Bit masking with EOR is less common than OR and ● hasOfficeHours |= 0b1111111111 makes it permanent. AND. ● Quiz: what would hasOfficeHours |= 0b101 do?

Example: Is a Number Odd? Example: Multiple of 8?

● Fact: A number is odd iff its least significant (i.e., rightmost) bit ● A binary value is a multiple of 8 (=23) iff it ends with is 1. 000. ● Related to the fact that a decimal number is a multiple ● Java: of 1000 (= 103) iff it ends with 000. if ( (myNum & 0b1) == 0b1) System.out.println(“Very odd number”); ● Java: if ( (myVal & 0b111) == 0) ● Note: decreasing precedence &&, ||, |, ^, &, == System.out.println(“multiple of 8”); Even if you don't have to, maybe parenthesizing is a good idea. ● Fact: a more general rule is that the rightmost k bits of It's hard to remember weird operators' precedence levels. X are (X mod 2k) (not certain about -ve numbers)

Bit Shifting Dynamically Generating Masks

● A bunch of operations let the element of a ● Shifts are useful for dynamically generating masks for use with bitvector play “musical chairs”. bitwise AND, OR, EOR. ● The of a bunch of bits is the number of bits ● logical left shift: every bit slides one position that are 1. (After Richard Hamming, 1915-1998.) Many modern left. The old leftmost bit is lost. The new CPUs have a special “population count” instruction to compute Hamming weight. Except for speed, it is not needed: rightmost bit is 0. Java << operator repeats this to shift the value several positions. int hWeight=0; // Hamming weight of int value x for (int bitPos=0; bitPos < 32; ++bitPos) { ● Eg, 0b11 << 4 is same as 0b110000. int myMask = 0b1 << bitPos; if ( x & myMask != 0) ● logical right shift: similar, Java >>> operator. ++hWeight;

Poor Man's Multiplication Poor Man's Division

● What happens if you take a decimal number and slap 3 ● So then, does shifting bits to the right then correspond to zeros on the right? It's same as multiplying by 103. division by powers of 2?

● Similarly, X << 3 is same as X * 8. Even works if X is ● For unsigned, yes. (Throwing away remainders). -ve. (Unless X*8 overflows or underflows) ● For 2's complement +ve numbers, yes.

● Poor man's X*10 is (X<<3)+(X<<1) ● -ve numbers: no. Regular right shift inserts zeros at the since it equals (8*X + 2*X) leftmost position (the sign bit). ● 11111000 → 01111100 means -8 → +124 ● Compilers routinely optimize multiplications by some constants like this, since multiplication is often a harder ● A modified form, arithmetic right shift inserts copies of the sign operation than shifting and adding. Called strength bit at the leftmost. Java operator >> vs >>> reduction. ● 11111000 → 11111100 means -8 → -4 as desired

Division by Constant, via Example: Divide x by 17 Multiplication and Right Shifting

● Low-end CPUs may not have an integer divide instruction but ● We get to choose p. Say p=28. may have a multiply. Want to divide by a constant y that is not a ● 256/17 = 15.05 is close to 15. power of 2. ● Compute (x*15) >> 8 to approximate x/17. ● Mathematically, x/y = x * 1/y ● Test run for x=175. (175*15)/256 = 10 (throwing away remainder). Good. ● Multiply 1/y by p/p, for p being some power of 2. Say p = 2k. So ● Test run for x=170. (170*10)/256 = 9 (because we throw away a remainder of .996). Oops! x/y = ( x * (p/y)) / p. ● Can be improved, but will never be perfect. Still, maybe an approximate ● p/y is a constant that you can compute. Division by p is a right answer is okay. shift. ● Closer approximations by using bigger values of p. ● Considerations: effect of truncations and whether the ● Using 32-bit integers, what is the biggest number we can divide by 17 this multiplication overflows. way, without getting overflow?

General-Purpose Mult. and Div. More Bit Shuffling

● What if you want to multiply and divide by a ● Most CPUs support more exotic ways for bits to play variable? musical chairs. No operator like >> in Java or C for this, though: ● Today, most CPUs come with instructions to do ● Left rotation by 1 position: this, except maybe the kind in your digital toaster. – every bit but leftmost moves left 1. The leftmost bit circles ● But you can always implement * by the Grade-3 around and becomes the new rightmost bit. shift-and-add algorithm. Or repeated addition. ● Left rotation by >1 positions is same result as doing multiple left rotations by 1. ● Division: see how many times you can subtract y from x (in a loop). Or (harder), implement the ● A right rotation by 1, or by >1 positions, also exists. algorithm you learned in Grade 3. ● Example: 1010011 right rotated by 1 is 1101001

Hacker's Delight Confessions

● Henry Warren's book, Hacker's Delight, belongs on ● A simple arithmetic right shift of a negative number is not the bookshelves of serious low-level programmers. quite the same as division by a power of 2. (Sometimes you can be off by one; two extra instructions can adjust for ● It is a collection of neat bit tricks collected over the this.)

years. It is the source of much of the ● A detailed analysis of the divide-by-a-constant approach implementation of Java's class Integer. (eg the divide-by-17 example) can avoid the small errors. People have worked out approaches for dividing exactly by ● Course website has a link to a web page with a similar collection of “bit hacks”. 3, 5, 7 using multiplications and shifts…. ● Chapter 10 of Hacker’s Delight is “Integer Division by ● Despite the title, this book is not about breaking Constants” and is 72 pages that are quite mathematical. security...it's the older, honourable use of “hacker”. Also the word “magic” appears many times.

Character Data Control Codes

● Characters are encoded into binary. One historical ● Control codes are often invisible when printed, method that is still used is a 7-bit code, American and some text editors won't show them. But Standard Code for Information Interchange. software (e.g. compilers) can be thrown off by ● ASCII contains upper and lower-case letters, them. Leads to puzzled students sometimes. punctuation marks and digits that would commonly have been needed for US English data processing needs. ● A common convention is to discuss control codes by using a letter preceded by ^. Eg, ^C. On ● ASCII also encodes other things that control the assumed teletypewriter machine: these control codes many keyboards, pressing the Ctrl key at the include carriage return, line feed, tab, ring the bell, end same time as the letter can generate a control of file, … code.

Backslash Escapes Unicode

● Many programming languages have have ● Many (non US people) found ASCII to limiting so attempts were backslash escapes to represent some popular first made to extend it to other Western European character sets. ● Unicode seeks to represent all current and historical symbols in all control codes. cultures and languages. Original idea was that 16 bits was enough. Java uses this early Unicode idea, so char in Java is 16 bits. ● Eg, '\t', '\n', '\r' in Java and C. You type \t as two characters, but it represents a single character (a ● Unicode version 9.0 (2016) has >100k characters plus many more symbols. 16 bits is not enough. tab, ASCII code 9, ^I) ● Each Unicode character is represented by a numeric code point. ● '\123' represents a single byte whose ASCII code First 128 of them correspond to ASCII, for backwards compatibility. is 123 in octal (base 8 – more on this later) (In First 216 are the Basic Multilingual Plane. many programming languages) ● There are several ways of encoding code points into bytes.

UTF-8, UTF-16 etc Strings

● UTF-32 uses 32 bits to store a code point. It is a fixed-width encoding: if I ● A string is a sequence of characters. In Java it's represented by know how many characters I need to store, I know precisely how many a String object, as you know. bytes it will cost me. ● In lower-level programming (C, assembler), a string is more likely ● But UTF-32 wastes bytes. Characters outside the Basic Multilingual Plane are rare. Codepoints from 0-127 (“ASCII”) are very common. viewed as a sequence of consecutive memory locations that store the successive characters in the string. ● UTF-16 represents codes in the BMP with 2 bytes. Weird codes outside need 2 more. Not a fixed-width encoding. ● Q: how do you know when the next memory location doesn't ● UTF-8 represents ASCII codes in 1 byte, other BMP codepoints with 2 or 3 store the next character in a string (how to know a string is over)? bytes, and weird codes with 4. ● Common convention: null-terminated string. A string always ends ● UTF-8 is fully backward compatible with old ASCII files. with a character whose ASCII code is 0. “C-style string” ● In Java, the constructor for FileOutputStream has an optional parameter ● ARM assembly language: if you want a C-style string, you have that can be “UTF-8” or “UTF-16” etc. Otherwise, it uses the operating system's default. to put the null at the end. Fun bugs if you forget.

Representing Fractional Values Fractional Values, 2.

● You can represent fractional values using a fixed- ● We can store all numbers by shifting the binary point right 3 point convention. In decimal and for money (unit (for example). So we are measuring everything in eighths. 5.375 is then stored as 101011, instead of 101.011. dollars), an example would be agreeing to store 10 each values as a whole number of cents. ● Can add and subtract fixed-point numbers successfully, as long as each is, for instance, measured in eighths. ● So 2.35 is stored as 235. We have shifted the decimal point by a fixed amount, two positions. ● But multiplying two numbers given in eighths results in a product that is measured in 64ths. So have to divide by 8 ● In unsigned binary 101.011 means 1012 and a (just shift right 3 positions...) -1 -2 -3 fraction of 0*2 + 1*2 + 1*2 . I.e., 5.375 ● Advantage: fractions handled using only integer arithmetic. .

Floats, Doubles etc. IEEE-754 Standard

● Scientific processes generate huge and tiny numbers. No ● IEEE-754 is the standard way to represent a binary single fixed-point shift will suit everything. floating point value in 16 (half-precision), 32 (single ● Measured values have limited precision - no sense to store precision), 64 (double precision), 128 (quad the number of meters to Alpha Centauri as an integer. precision) or 256 (octuple precision) bits. ● Floating-point representation is a computer version of the ● “scientific notation” you learned in school, eg: 3.456 x 10-5 32-bit form available in C & Java as float; 64-bit form as double. ● +3.456 is the significand or mantissa. We've 4 sig. digits ● ● Number is normalized to 1 significant digit before decimal Overall, it's a sign-magnitude scheme. point. ● But exponent is signed quantity using the biased ● The exponent is -5, and the sign is positive. approach.

IEEE-754 Floats Example

● 1 sign bit, S. (Bit position 31) ● Find the numerical value of a float with bits ● next, 8 exponent bits with binary value E. 0 00111100 001100000000000000000000 ● Use formula (-1)S * 1.ffff...fff * 2(E-b) ● Exponent bias b of 127. ● S=0. E=001111002 = 6010. b=127 (always) ● 23 fraction bits fff..fff to represent the significand of 1.ffff..fff. Note “hidden” or “implicit” leading 1. ffff...fff = 00110000000000000000 ● So: -10 * 1.0011 * 2-67 ● Formula for “normal” floats: 2 = +1 * (1 + 3/16) * 2-67 Value = (-1)S * 1.ffff...fff * 2(E-b) ● It's a small positive number. Calculator for details.

Example 2: Determine the bits Representable Values

● Determine how to represent -2253.2017. ● There are/is an uncountably infinity of real numbers. ● There are at most 232 different bit patterns for a float. No bit pattern ● Helpful facts: 2253 = 2048 + 4 + 1. represents more than one real number.

● ● 24 Therefore, there are real numbers that cannot be represented. 0.2017 * 2 = 3383964 + fraction (Overwhelming majority.)

● ● 23 338396410 = 11001110100010100111002 For any given exponent value, there are only 2 different mantissas, 1.000... to 1.111... ● Now let's put the pieces together. ● No number whose exponent exceeds 255-127

● No number whose exponent < (0 – 127).

● (Though in CS3813 you'll learn about subnormal numbers, so I've lied; IEEE-754 is a fair bit hairier than I've presented.)

Example IEEE-754 Doubles

● What is the next representable value, after ● 64 bits, divided up into 5/16? – 1 sign bit ● 5/16 = 0.0101 * 20 = 1.0100...0 * 2-2 – 11 exponent bits, bias 1023

● Now let's reason. – 53 fraction digits ● More exponent bits: better range of numbers ● More fraction bits: smaller gaps between representable numbers (higher precision). ● Otherwise, like Float.

Machine Instructions Hexadecimal

● Another thing that becomes binary: machine instructions. ● A decimal number has about 1/3 as many digits as the ● Typical m/c instruction has corresponding binary number: small base, lots of digits. – an operation code (opcode) that indicates which of the supported operations is ● desired Humans do poorly with many-digit numbers. – codes indicating addressing modes that provide the input data (“operands”) ● So for humans, it is handy to work in larger bases. But it's – code indicating where the result should be put hard to convert base 10 numbers into base 2 numbers. – code indicating the conditions under which the instruction should be ignored ● Base 16, or hexadecimal (hex), is the go-to base for ● An instruction-format specification helps you determine how to assemble these codes into a machine-code instruction. machine-level human programmers. – ● Chapter 1: To store the constant 8 into a register variable: numbers have few digits 0b11100011101000001101000000001000 in ARM m/c code. – it's easy to convert to/from binary ● We'll study ARM instruction formats later.

Hexadecimal digits Converting Hex to Binary

● Whereas decimal uses digits 0 to 9, ● Because 16=24, each hex digit expands to 4 hexadecimal uses 0-9,A,B,C,D,E,F. binary digits. – Digit 7 has value seven, just like decimal ● – Digit A has value ten, B has value eleven, … For 0x9A4 F has value fifteen – the 9 expands to 1001 ● Numbers have a ones place, a sixteens place, a 162s place, a 163s place, etc. – the A expands to 1010 ● 2F32 means 2*163 + 15*162 + 3*161+2 – the 4 expands to 0100 ● In many languages, you prefix hex constants with 0x ● so int fred = 0x2f32; // works fine in Java. So 0x9A4 expands to 0b100110100100 int george = 0x100; // equivalent to george = 256;

Converting Binary to Hex Small Negative Numbers

● Binary → Hex is the reverse process. ● We usually use unsigned hex to reflect bit patterns, even if they meant to be 2's complement numbers. ● Only trick: you want the binary number to be the correct length ( a multiple of 4 in length) ● So what does a 32-bit negative number look like, if it is pretty close to 0? ● So zero extend it, if necessary ● The corresponding bit pattern has a lot of leading ones. ● Then each group of 4 bits collapses to a hex digit. When converted to hex, each group of 4 ones turns into ● 101010 → 0010 1010 → 2A an F digit.

● Rather than count bits and zero extend first, just circle ● So your not-very-negative number has lots of leading F's. groups of 4 bits starting from the right. If the last ● 0xFFFF FFF3 is the bit pattern for -5 = 0b111...111011 group has fewer than 4 bits, it's okay. 10

Hex Arithmetic Example: Hex Addition

● It's sometimes handy to addition and subtraction of ● A debugger reports that an item begins in memory at hex numbers without converting to decimal. address 0x1234. You know its size is 0x7D. What is the first address after the item? ● (typically, subtraction when you want to figure out ● 1234 the size of something in memory, and you've got the + 7D starting and ending positions) ● 4 is worth 4, D is worth 13. Sum to 17, or 0x11 ● Like Grade 3, except your addition/subtraction table ● Keep the 1, carry the 16 to the next stage is bigger. ● 3 and 7 sum to 10. But there is a carry, bumping you to 11, – don't memorize: just use the values of digits or 0xB. No carry to next stage. – you carry and borrow 16, not 10 ● So 1234 + 7D = 12B1.

Example: Hex Subtraction Octal (base 8)

● 1203 ● In bygone days, octal (base 8) was an alternative to -0F15 hexadecimal. ● Since 3 < 5, borrow from 0x20 (making it 0x1F). ● Conversion to/from binary is by grouping bits into groups of ● You borrowed 16, so 3 is now worth 16+3=19. size 3, but otherwise same as hex. ● Take away 5, get 14. Hex digit for 14 is E. ● Octal survives in some niches. In a string or a character, a ● F-1 is E (no borrow needed). backslash can be followed by 3 octal digits (typically the ● 1-F needs a borrow, makes 1 worth 16+1=17. ASCII code of some otherwise unprintable character). ● Take away F (value 15), get 2. (hex digit is 2). ● In Java and C, any digit string that starts with a leading zero ● You borrowed from the 1, so its 0. 0-0=0. is assumed to be octal. Remaining digits must be 0 to 7. ● You could write down this leading zero, if you wanted.... ● So: int fred = 09; // mysterious compile error ● 1203-0F15 = 02EE

ARM v4T

● History of ARM processors ● R is for RISC ARM v4T ● Registers CS2253 Owen Kaser, UNBSJ ● Status flags and conditional execution ● Memory ● Example program

History of ARM v4T History of ARM v4T, cont.

● Acorn Computers in the UK, early 1980s ● The ISA has been added to over the years. ARM v4T ● Designed own CPU for a line of PCs, based on cutting-edge dates from early 1990s. design trends then. ● Actually, v4T has the regular 32-bit ARM ISA and a ● Cutting edge was RISC: Reduced Instruction Set Computers. simpler Thumb ISA, where instructions can be 16 bits long. We ignore Thumb in CS2253. ● ARM was the Acorn RISC Machine ● ● Circa 1990, retitled Advanced RISC Machine and the design New versions of the ISA have come out in the meantime was licensed to other companies to manufacture or add extra (though old are still being produced). components, as part of a System-on-a-Chip. ● ISAs that evolve tend to get ugly, preserving backwards ● Like the extra stuff to make an Apple Newton, an iPod, a compatibility. There is now a 64-bit ISA that apparently is Nokia phone... once again clean. Maybe we can shift 2253 to it in future.

ARM is Popular What's RISC?

● ARM variations are the champion in popularity ● The R in ARM stands for Reduced Instruction Set for mobile devices. Computer. – ● By 2002, there were 1.3 billion manufactured in contrast to the extremely complicated CPUs of the late 1970s (VAX had an “evaluate polynomial” ● In just 2012, 8.7 billion were manufactured. instruction, for instance) A “CISC” machine has some advantages, in “code density”. – complex means expensive to make, and hard to make run fast. – RISC tried to simplify ISAs, so implementation can be simple and fast.

RISC Principles ARM v4T Components

● There should be a small number of instructions. ● There are 15 main registers, R0 to R14. Each can ● Every instruction should do something very simple, so it can store any 32-bit value. R13 and R14 are a tad special. run in 1 clock cycle. ● As a first approximation, a HLL programmer can view ● All machine codes should be the same length (32 bits). them as the only real “variables” you have. ● There should be relatively few different machine code ● R15 is also called PC (Program Counter) and keeps formats. track of where to fetch instructions.

● Should be a fair number of storage registers, and most ● Due to “pipelining”, when an instruction executes, PC operations should involve only them. actually stores the address of the instruction that is 8 ● Values should be transferred between RAM and registers by bytes ahead. Pipelining is an advanced CS3813 topic. explicit Load and Store instructions.

Example Instructions ARM Components: CPSR

● Add two register values, result in 3rd register. ● The Current Program Status Register is a collection of ● Exclusive-OR two register values, results in 3rd. 12 miscellaneous bits. – ● Change the program counter (subtract 16 from it) 4 keep track of how recent instructions went (“status flags”) – ● Get a halfword from memory, at an address that is 10 more 8 allow you to see and control the processor configuration than the current value of R1. Sign extend it and put it in R2. (“control bits”). We don't need them initially. Modify R1 to be increased by 10. ● Chapter 2 of the textbook tells you about other ● Store the first byte in R1 into memory, address obtained by advanced concepts that aren't needed until the hardest taking R2 and shifting it left 2, then adding that value to R3. parts of the course, much later. In each case, the technical ARM documentation can tell you ● Please ignore anything about “processor modes” other how the instruction would be encoded into bits. than User, for now.

Status Flags Conditional Execution

● Most ISAs (except the MIPS ISAs we often study in ● Most ARM instructions can be made conditional, so they CS3813) use status flags. do nothing unless the specified status flags are set. ● They help record the outcome of an earlier instruction, ● Example: 64-bit counter. so that your program can do different things, depending – First instruction sets flags while incrementing the low-order 32 on what happened earlier. bits ● Flags are N (bit 31 was 1), Z (all bits were 0), V (result – Second instruction runs conditionally and only increments the oVerflowed), and C (there was a Carry out) high-order 32 bits if the Z flag is set – Maybe low-order bits in R1 and high-order in R2 ● Many instructions have a version that updates the flags and another that doesn't. But some instructions always ● Non-ARM ISAs generally have only a few conditional update the flags. instructions (the ones that implement IF)

Constants Memory

● Many ARM operations can use constants (just ● The ARM processor is byte addressed, in that every byte of like you can add two registers together, you can memory has its own address, starting from address 0. add a register to a constant, etc.) ● Addresses are 32 bits long, leading to a maximum of 4GB of memory (at least for a given running program). [But note that ● ARM constants are weird. Numbers -128 to some addresses are typically carved out for non-memory.] 255 are okay, as are a few larger numbers ● Special Load and Store instructions are used to access memory. You can transfer 1, 2 or 4 bytes in one operation. ● Allowable larger numbers are those obtained by ● In ARMv4, 4-byte transfer must begin at a memory address rotations of 0-255 by an even number of that is a multiple of 4: the alignment rule. Similarly, 2-byte positions, etc. More later. transfers must begin at an even address.

Big Endian vs Little Endian Example Program

● When a 4-byte word is laid out in memory, does the ● Compute 10+9+8+7+6+5+4+3+2+1 most-significant byte (big end) come first, or the – Put the constant 0 into R1 least-significant byte (little end)? – Put the constant 10 into R2 ● A religious war arose between the two camps. – Add R1 and R2, put the result into R1 – Subtract the constant 1 from R2 and set the status flags ● ARM7TDMI processor can do either, but the default for ARM is usually little-endian. – If the Z flag is not set, reset the PC to contain the address of the 3rd instruction above. ● The issue is only visible if you write a word/halfword ● Each of these instructions can be encoded into machine into memory and then try to read it back in smaller code, if you are willing to slog through the reference pieces (eg bytes). manuals enough.

Assembly Language

Assembly Language CS2253 ● Some insane machine-code programming Owen Kaser, UNBSJ ● Assembly language as an alternative ● Assembler directives ● Mnemonics for instructions

Machine-Code Programming (or, Put 0 into R1 Why Assemblers Keep Us Sane)

● Compute 10+9+8+7+6+5+4+3+2+1 ● There's a Move instruction, or you could subtract a register from itself, or EOR a register with itself, or... let's use Move. – Put the constant 0 into R1 ● Book Fig 1.12 – Put the constant 10 into R2 ● – Add R1 and R2, put the result into R1 ● cond = 1110 means unconditional

– Subtract the constant 1 from R2 and set the status flags ● S=0 means don't affect status flags – If the Z flag is not set, reset the PC to contain the ● I=1 means constant; opcode = 1101 for Move rd address of the 3 instruction above. ● Rn = ???? say 0000; Rd = 0001 for R1 ● Let's try to make some machine code. ● bits 8-11: 0000 Rotate RIGHT by 0*2 ● bits 0-7: 0x00 = 0x00

● So machine code is

1110 00 1 1101 0 0000 0001 0000 00000000 = 0xE3A01000 Put 10 into R2 ●Add R1 and R2, put result into R1

. ● Same basic machine code format as Move

● cond = 1110 means unconditional

● S=0 means don't affect status flags ● cond = 1110 for “always” ; I=0 (not constant)

● I=1 means constant; opcode = 1101 for Move ● opcode = 0100 for ADD; S=0 (no flag update)

● Rn = ???? say 0000; Rd = 0010 for R2 ● Rn = R1, Rd = R1

● bits 8-11: 0000 (rotate right by 2*0 ) bits 0-7: 0x0A ● shifter_operand = 0x002 for R2 unmolested

● So machine code is ● Having fun yet??

1110 00 1 1101 0 0000 0010 0000 00001010 = 0xE3A0200A ● 1110 00 0 0100 0 0001 0001 0000 0000 0010 = 0xE0811002

●Subtract 1 from R2, result into R2 Maybe Rinse and Repeat

● Same basic machine code format as Move ● If the Z flag is not set, we want go back 2 instructions before this one.

● book Fig 3.2

● cond = 1110 for “always” ; I=1 (constant) ● cond = 0001 means “when Z flag is not set” ● opcode = 0010 for Subtract; S=1 (yes flag update) ● L=0 means “don't Link” (Link changes R14) ● Rn = R2, Rd = R2 ● signed offset should be -4. The PC is already 2 instructions ahead ● shifter_operand = 0x001 for 1 rotated right 0 positions of this one, and we want to go back 2 more than that. ● 0001 101 0 111111111111111111111100 = 0x1AFFFFFC ● 1110 00 1 0010 1 0010 0010 0000 0000 0001 = 0xE2522001 ● Are you REALLY having fun yet ??

How'd you know the cond codes? How'd You Know the Shifter Magic?

An Assembler

● Rather than making you assemble together all the various bit fields that make up a machine instruction, let's make a program do that.

● You are responsible for breaking the problem down into individual instructions, which will be given human friendly names (mnemonics).

● You give these instruction names to the assembler, along with various other directives (aka pseudo-ops) that control how the assembler does its job.

● It is responsible for producing the binary machine code.

● It also produces symbol table information needed by a subsequent linker program, if you write a multi-module program.

Assembly Language The Bad News ● You communicate with the assembler via assembly language (mix of mnemonics, directives, etc.) ● Anyone who creates an assembler gets to define ● Assembly language is line-oriented. their own assembly language (ignoring

● A line consists of manufacturer's suggestions). Dialects? – an optional label in column 1 ● Textbook shows code for Keil and Code – an optional instruction or directive (and any arguments) Composer Studio. But we use Crossware's – an optional comment (after a ; ) assembler, which is yet another dialect and it's ● Example: hard to find documentation on it. here b here ; create infinite loop. ● Textbook talks about “Old ARM format” and “UAL

● “here” is a label that marks a place format”. Crossware is a mixture (more old).

● b is a branch instruction, forces the PC to a new location (here). Our Program in Assembly Register Names

mymain mov r1,#0 ← mymain is the label ● r0 to r15 (alias R0 to R15) mov is the instruction ● # precedes the constant SP or sp, aliases for R13 ; nice comment, eh? ● LR or lr, aliases for R14 mov r2,#10 ; put 10 into r2 (bad comment) ● PC or pc, aliases for R15 myloop add r1, R1, r2 ← case insensitive for reg names subs r2, r2, #1 ← final s means to affect flags ● cpsr or CPSR (the status registers etc) bne myloop ← condition is “ne” (z flag false) ● spsr or SPSR, apsr or APSR (later) sticky b sticky ← so we don't fall out of pgm ● end ← directive to assembler: you're done not s0-s3 or a1-a4 (unlike book page 63) ;don't use “end”; it seems to be buggy in Crossware

Popular Assembler Directives Directive to Set Aside Memory

● Textbook Section 4.4 describes the set of directives ● The SPACE directive tells the assembler to set aside a supported by the Keil assembler and the TI specified number of bytes of memory. These locations will assembler. be initialized to 0. ● Usually have a label, since you need a name to refer to the ● Our Crossware assembler is different than both (but allocated memory. closer to Keil). ● Example ● Let's look at directives to – myarray SPACE 100 – set aside memory space for variables/arrays – myarr2 SPACE 100*4 ←constant expression's ok – define a block of code or data

– give a symbolic name to a value ● Later, instructions can load and store things into the chunks of memory by referring to the names used.

● If myarray starts at address 1234, myarr2 starts at 1234+100 Use of SPACE Directives for Memory Variables

● An assembly language programmer uses ● Use DCB to declare an initialized byte variable. SPACE for the same reasons that a Java ● DCW for initialized halfword, DCD for word. programmer uses an array. ● Example myvar1 DCB 50 ← decimal constant myvar2 DCB 'x' ← ASCII code of 'x' myvar3 DCB 0x55 + 3 ← constant expression

● If myvar1 ends up being at address 1234, then myvar2 will be at 1235 and myvar3 at 1236

Alignment Alignment Example

● DCW assumes you want the memory variable v1 DCB 10 v1 DCB 10 to start at a multiple of 2 (“halfword aligned”) v2 DCW 20 v2 DCWU 20 v3 DCB 30 v3 DCB 30 ● DCD assumes you want alignment to a multiple v4 DCD 40 v4 DCDU 40 of 4.

● To achieve this, assembler will insert padding. If v1 is at address 3000, then If v1 is at 3000, then v2 starts at 3002 (1 byte of v2 starts at 3001 ● If you really want to set aside a word without padding) v3 is at 3003 padding, use DCDU. The “U” is for unaligned. v3 is at 3004 v4 starts at 3004 (aligned by luck) ● There's also DCWU. v4 starts at 3008 (3 bytes padding)

More Alignment Control DCB with Several Values

● You can use DCB with several comma-separated values

● Several consecutive memory locations are set aside. A label ● Keil assembler has an ALIGN directive that can names the first of them. force alignment to the next word boundary ● Example: foo DCB 1,2,3,4 (inserting 0-3 bytes of padding). ● We can access the location initialized to 3 as “foo+2”

● A quoted string is equivalent to a comma separated list of ASCII values. ● In Crossware, the directive takes a numeric DCB “XY” is same as DCB 'X','Y' or DCB 88,89

argument. So ALIGN 4 (or ALIGN 8) ● DCW and DCD can also take a comma-separated list.

● Common use: make a small initialized table.

DCB: Signed or Unsigned? AREA directive

● DCB's argument must be in the range -128 to +255. ● In general, an assembly language program can have several blocks of data and several blocks of code. And it can be written ● -ve values are 2's complement in several different source-code files. ● +ve values are treated as unsigned ● The AREA directive marks the beginning of a new block. You give it a new name and specify its type. ● So DCB -1, 255 is same as – eg AREA fred,code DCB 255, 255 – You can go back to a previous area by using an old name

● ● Similarly DCW's arguments in range -32768 to A tool called a linker runs after the assembler to put your +65535. various sections (and any library routines you need) into a single program. ● 31 32 DCD from -2 to +2 -1 ● Much more on linkers later in the course

AREA Example Code in Data, Data in Code

AREA mycode,code ● Q: Is this allowed; if so, what does it do? foo add R1, R2, R3 add R4, R5, #10 AREA mycode, CODE starthere add R1, R2, R3 AREA mydata, data DCD 0x1234567 ; this line is fishy var1 dcb “cs2253” add R2, R3, R4 AREA mycode ← continues mycode where it left off AREA mydata, DATA add R6, R7, R8 var1 DCD 1234 This feature allows for us to show our data declarations near var2 add R2, R3, R4 ; this line is also fishy the code that uses them (maybe good software engineering), var3 DCB “hello world”,0 even if the different sections end up being far apart in memory. Memory picture on board...

Operators in expressions EQU: Give a Symbolic Name add R4, R5, #10 ↔ add R4, R5, #3+3+3*1+1 ● The EQU directive is used to give a symbolic name to an expression. Use it to make code easier for humans. ● Both of the above generate the same single machine-code instruction. ● Example fred DCB 20, 200, “Frederick Wu” ● The + and * operators are just requests to the assembler to do a little bit of math when it fred_age EQU fred+0 processes the line. No runtime effect. fred_height EQU fred+1

● Other operators supported by Crossware are | fred_name EQU fred+2 and & (bitwise AND and OR). Also >> and <<.

● I can't find XOR, mod (unlike Keil and CCS on Subsequent instructions can load data from fred_height page 75) rather than the more cryptic fred+1. But to the assembler, both loads will be equivalent. Directives Crossware May Lack A Few Instructions

● Compared to Keil and CCS, our Crossware assembler ● Assembler directives are great, but the main thing does not appear to support some directives. I can't find good in assembly language is to specify instructions (and documentation, so maybe they exist under a different name :( then get the assembler to generate the associated – ENTRY machine codes) – RN ● So far (from the loop example) we know – LTORG, though we do have the “LDR rx,=” construct (eg textbook page 72) – add – SETS – sub ● Also, the SECTION directive only takes attributes – b CODE and DATA. Not the others in textbook Table 4.3. – mov ● Crossware does support macros and conditional assembly, advanced topics for later in the course.

A Few More Instructions (Table 4.1) Mnemonics

● These are math-ish instructions: ● A mnemonic is “a memory aid”.

– RSB – reverse subtract ● It’s hard to remember the bit pattern associated – ADC, SBC – add/subtract with carry with a machine operation. – RSC – reverse subtract with carry ● As a memory aid, we have human-friendly – MVN – move “negative” (a bitwise NOT) names like ADD, SUB etc. – AND, ORR, EOR, BIC – bitwise logical operations ● They are our mnemonics. – MUL, SMULL, UMULL – various * ops – MLA, SMLAL, UMLAL – multiply/accumulate.

From Reference Example: Swapping

● Java swap of v1 and v2: temp = v1; v1 = v2; v2 = temp;

● Naive ARM swap of r1 and r2 mov r3, r1 mov r1, r2 mov r2, r3 ● Clever swap avoids trashing r3 (book p 53): eor r1, r1, r2 eor r2, r1, r2 eor r1, r1, r2

● Book “Hacker's Delight” is full of this kind of trick. Example: 64-Bit Addition Computing Your Grade

● Assume r1 contains the high 32 bits of value X ● Test was out of 80. Prof told you how many and r2 contains the low 32 bits points you lost (put the number into R1). Figure out what your grade out of 80 was: ● Assume r3 contains the high 32 bits of Y and r4 contains the low 32 bits. RSB R2, R1, #80

● Want result in r5 (high bits) and r6 (low bits) ● Now your grade is in R2. ADDS r6, r2, r4 ; add low words [affect flags] ADC r5, r1, r3 ; add high words

Constant Operands Why This Weirdness ● Most instructions have register values or constants as the operands ● Studies show that most constants are small. ● (Exception: Load and store instructions – later) ● Among larger constants, bit-masks containing a small chunk of mixed bits are common (surrounded ● All 8-bit constants are okay by zeros) ● As are all constants of the form ● Similar bitmasks that are mostly 1s can be handled RotateRight( v, 2*amt) by using the MVN instruction

where v is an 8-bit value and amt from 0 to 15. ● A RISC architecture with 32-bit instructions isn't ● So 0xAB is ok long enough to encode an arbitrary 32-bit constant. So just allow the most common ones. – so is 0xAB0 ( 0xAB with a 28 bit rotate right) ● Assembler complains if you use a constant that – so is 0xB000000A (0xAB with a 4-bit rotate right) cannot fit this weirdness. Machine Instruction With Constant The Barrel Shifter's Place

Shifted Register Operands ARM Shifts and Rotates

● If the second operand is a register value, the barrel shifter can modify it as it travels down the B bus.

● Barrel shifter is capable of LSL (logical left shift) – LSR (logical shift right) – ASR (arithmetic shift right) – ROR (rotate right) – RXX (33 bit ROR using carry between MSB and LSB) ● No modification desired? Shift by 0 positions!

● Carry flag is involved (but the new carry value is not necessarily written into the status register)

How Much Shifting Machine Encoding (from Ref Man) ● With RRX, it appears the register can only be shifted by one position. ● Below, shift field is 00 for LSL, 01 for LSR, 10 ● With others, you can shift 0 to 31 positions for ASR, 11 for ROR. RRX also 11 with count – Either as a constant (“immediate”) of 0 (and rotates only one position). – Or by the least significant 5 bits of a register ● There are separate machine code formats for these cases. – Bit 4 distinguishes the cases – Bits 5 & 6 say what kind of shift/rotate – Bits 11 to 7 involve which register, or the constant

Example Setting Conditions

● Machine code to take R1, logical left shift it by 3 positions, result in ● Any of the data-processing instructions so far can R2 optionally affect the flags. ● Assembly language: MOV R2, R1, LSL #3 ● At the machine-code level, bit 20 (called S) controls this: ● It’s the “immediate shift” format: S=1 means to set the flags – Bits 27, 26, 25 and 4 are all 0 ● ● Bits 11 to 7 are 00011 (for the #3) In assembly language, you append an S on the mnemonic. ADDS instead of ADD ● Bits 3 to 0 are 0001 (since R1 is being shifted) ● ● Bits 5 & 6 are 00 to select the LSL kind of shift Also, there are some instructions whose sole purpose is to set flags: they don’t change any of R0 to R15. ● Unconditional, bits 31 to 28 are 1110; MOV opcode 1101 ● ● So: 1110 00 0 1101 0 ???? 0010 00011 00 0 0001 = 0xE1A02181 Compare (CMP, CMN) and Test (TST, TEQ) instructions.

Sum to a Limit Multiplication

● Let’s add 1+2+3+… until sum exceeds R4 (unsigned) ● The ARM v4 ISA has 6 multiplication MOV R1,#0 ; The sum instructions.

MOV R2,#1 ● Does not include “multiply by a constant” LP ADD R1, R1, R2 ● Why several? ADD R2, R2, #1 – Should product be 32 bits or 64 bits? CMP R1, R4 ; computes R1 – R4, sets flags – Are the input values considered signed? BLS LP ; LS = unsigned Lower or Same (CF=0 or Z=1) ; use LE for signed Lesser or Equal

32-Bit Products 64-Bit Product (Long Multiply)

● Fact: Since the product stored is the low-order ● Results are stored in a pair of registers. 32 bits of the true product, signed and unsigned ● The “accumulate” version has the product added onto the 64-bit variations would give same result. So not value in a pair of registers. separate instructions. ● SMULL – signed long multiply ● UMULL – unsigned long multiply ● MUL instruction: Two registers' values multiplied, ● UMLAL - unsigned long multiply accumulate low-order 32 bits stored in destination register. ● SMLAL – signed long multiply accumulate

● MLA (multiply and accumulate). The low order ● Ex: UMLAL R1, R2, R3, R4 means 32-bits of the product are added to a 3rd register (R1, R2) ← (R1, R2) + R3*R4 with unsigned math and stored in a 4th register. – Above, R1 is the least significant 32 bits ● Eg: MLA R4, R1, R2, R3 ; R4 = R1*R2 + R3

Overview

● Loads and Stores

● Memory Maps

● Register-Indirect Addressing

ARM Memory ● Post- and Pre-indexed Addressing Owen Kaser, CS2253 Mostly corresponds to book Chapter 5.

16 Registers is Not Enough Loads and Stores

● So far, the only places discussed for data are the ● Recall that ARM is a “load/store” architecture. Cannot ARM's CPU registers directly do calculations on values in memory. Have to load them into a CPU register to use them as inputs. ● Most interesting programs need more data. ● Similarly, calculations put results into registers. Then you can use a store instruction to put them into memory. ● We need memory outside the CPU for our bulk data storage. ● Loads and stores need to specify where in memory things should go. This will be a numeric “memory address”. ● Also, memory can contain pre-computed tables ● (Memory) addressing modes are small built-in calculations (eg, of trig functions) that are never altered the CPU can do, to compute the memory address. ● For your toaster's software, the machine code ● Simple case: value in, say, R3 is to be used as the can be set at the factory. Fancy toaster: you can address.

“flash” your toaster with improved software. System Memory Maps Ex. Memory Map (extracts from book Table 5.1)

● A system built around an ARM7TDMI processor uses 32-bit Start End Description values as memory addresses. Each address would 0x00000000 0x0003FFFF On-chip flash correspond to a byte (oops, octet). 0x00040000 0x00FFFFFF reserved 0x01000000 0x1FFFFFFF ROM ● The overall “memory address space” ranges from 0 to 0x20000000 0x20007FFF (Static) RAM 0xFFFFFFFF. ….. 0x4000C000 0x4000CFFF UART 0 (a “serial port”) device ● But the overall memory address space is further subdivided (boundaries are often small multiples of powers of 2) ….. ● RAM, ROM, flash, and I/O devices can be given their own 0xE0001000 0xE0001FFF “data watchpoint and trace” (DWT) facility subdivisions. …. 0xE0004000 0xFFFFFFFF reserved ● More on I/O devices later in the course. For now, just realize that some memory addresses accept stores, and some ignore them.

For Simplicity.... Register-Indirect Addressing Mode

● Let's only mess with addresses in a range that ● Let's suppose you want to load the byte at address corresponds to RAM memory. 0x00005000 into register R3.

● ● Then, loads and stores both make sense. 8 bit value into a 32-bit container. If we want the 8-bit value to be zero-extended, use LDRB instruction.

● If you want it sign-extended, use LDRSB.

● Simplest case: a register stores the address of some data you care about. Let's go for R1.

● Assembler: MOV R1, #0x00005000 ;address to R1 LDRB R3, [R1] ; memory value to R3

Looping Through Memory Speeding It Up

● Let's suppose you want to wipe clear (to 0) the ● If the area to be cleared is properly aligned contents of all memory locations from (starts on a multiple of 4) and is the right size (a 0x00005000 to 0x00005FFF. multiple of 4) we can clear out 4 consecutive ● A loop will work nicely. addresses with one STR (store word) instruction. MOV R1, #0x00005000 ; starting location ● MOV R2, #0x00006000; when to stop Recall that a 32-bit word is stored across 4 MOV R3, #0 addresses: A, A+1, A+2, A+3. LP STRB R3, [R1] ; wipe clear current location's value ADD R1, R1, #1 ; advance to next location TEQ R1, R2 ; has R1 hit the stopping location? BNE LP

…. Faster Code Even Faster

MOV R1, #0x00005000 ; starting location ● The pattern of “use a register to provide a memory MOV R2, #0x00006000; when to stop address, then update the register in preparation for MOV R3, #0 ; 4 bytes of zeros the next loop” is extremely common. LP STR R3, [R1] ; wipe clear current location's value AND the next 3 locations' values ● ARM designers created an addressing mode that ADD R1, R1, #4 ; advance to location of next group of 4 bytes does BOTH of these operations in a single TEQ R1, R2 ; has R1 hit the stopping location? instruction. “post-indexed” BNE LP ● STR R3, [R1], #4 is equivalent to STR R3, [R1] ● Loop runs only ¼ as many times now. ADD R1, R1, #4

Textbook Figure 5.2 Even Faster Code

MOV R1, #0x00005000 ; starting location MOV R2, #0x00006000; when to stop MOV R3, #0 ; 4 bytes of zeros LP STR R3, [R1], #4 ; wipe 4, then advance “pointer” R1 ADD R1, R1, #4 ; advance to location of next group TEQ R1, R2 ; has R1 hit the stopping location? BNE LP

Java Pre- vs Post-Increment Post-Indexed Addressing

● Can draw a parallel to Java's ++ operators. ● In ARM, post-indexed indexing takes a base register. (Should not be R15.) ● Recall, v = M[ p++] in Java ● Uses that base register's value to go to memory – it uses the current version of p to index M ● Then updates the base register's value by a little – then it increments p. post-increment. computation ● Versus v = M[++p] in Java – adding/subtracting a constant (earlier example) – adding/subtracting a register – it first increments p pre-increment ● which is allowed to be modified by the barrel shifter – then then new value of p is used to index into M ● can be shifted/rotated by a constant amount ● can be shifted/rotated by a register amount ● Usefulness of fanciest of these seems doubtful

● LDR R1, [R2], ROR R3 ; is this useful??? Useful? Example Pre-Indexed Addressing

● Java, for an int array M, variable x: ● There are two flavours of pre-indexed addressing. j = 0; Both do a little computation and use the computed effective address to go to memory. In one, the base while (….) { register is updated. Other flavour does not update.

sum += M[j]; ● In assembly language, the ! symbol means to update j += x;} the base register. Don't use R15 as the base register with !

● Ok to use R15, without ! The value of R15 is 8 bytes ● ARM: suppose x in R2, start of M in R1 beyond the start of the current machine code. [Details ● In loop body: LDR R3, [R1], R2 LSL #2 of why are a bit advanced.]

Rationale for the “little computations” Pre-indexed Figure (Textbook)

● PC-relative addressing for constants

● Getting a field of an object, given the start of the object.

● Indexing into array of objects, selecting a field (if the object size is a power of two) ● Instruction is STR r0, [r1, #12]

● (Selected largely by analyzing what compilers ● Add ! to update r1 when finished: for HLLs would find useful, I think...rather than STR r0, [r1, #12]! ; r0 ← x20c focussing on assembly language programmers)

Some Pre-indexed Examples Ex: Field Access for an Object

● MOV R1, 0x123456578 fails. Constant is not a rotation of ● In HLLs, the fields of an object occupy consecutive an 8 bit value. memory addresses (possibly with padding) ● Instead, initialize a memory location with your constant. ● Let's suppose that an object starts at 1000. There Then use PC-relative addressing to load it. are two 32-bit fields, then a 16-bit halfword field that ● LDR R1, myConst ; pseudo-op we want to load into R2. … 1000 bytes later... myConst DCD 0x12345678 ● Let's suppose that R1 contains the starting address ● The LDR instruction is actually something like of the object. LDR R1, [PC, #996] ; PC was already 8 ahead ● Use LDRH R2, [R1, #8] ; immediate offset is 8 ● 996 is close enough to PC. Must be within 4 kiB. (Desired field starts 8 bytes later: gotta skip over first two words.)

● (Minor point: LDRH requires offset ±256) Ex: Array Access No ADR Pseudo-op

● ● The Crossware assembler does not seem to support ADR, which is used to Suppose R1 contains the starting address of an put an address into a register (that you will then use as a base register). For array. instance, summing values in array…

● Suppose the array's elements are 4 bytes each MOV R0, #0 ; accumulate answer ● To load the wth array element, we want address ADR R1, MyArr ; Keil pseudo-op ADR R2, AfterMyArr ; past last valid address R1 + 4*w LP LDR R3, [R1], #4 ADD R0, R0, R3 ● Suppose value w is in R2 TEQ R1, R2 ● LDR R5, [R1, R2 , LSL #2] loads desired value. BNE LP ….. MyArr DCD 34, 23, 56, 78, 12345566, ……...

AfterMyArr DCB 0 Instead of ADR LDR As Pseudoinstruction ● Instead of ADR, you should be able to do the following:

● LDR Rx, =value works for any 32-bit value (address or MOV R0, #0 ; accumulate answer constant). LDR R1, =MyArr ● It sets aside space in a “constant pool” , preinitialized to LDR R2, =AfterMyArr ; past last valid address value. This constant pool is (by default) at the end of LP LDR R3, [R1], #4 the current AREA. ADD R0, R0, R3 ● Then it generates machine code for a PC-relative LDR TEQ R1, R2 into Rx from this preinitialized location.

BNE LP ● Like a convenient DCD and LDR Rx, [PC, #something] ….. ● See textbook Chapter 6. MyArr DCD 34, 23, 56, 78, 12345566, ……... AfterMyArr DCD 0 ; wasted word, could avoid...

Machine-Code Formats Meaning of Some Bits (Ref Man) LDR/STR/LDRB/STRB

● From reference manual:

Exercise/Example Load and Store Multiple

● Determine machine code for ● There are instructions LDM and STM that load LDR R3, [R1], #4 or store a number of registers. and also ● With LDM, a bit vector in the machine code indicates which register to load. They are STRB R3, [R1, R2, LSR #5]! loaded from consecutive addresses.

● STM works similarly

● They are especially useful in storing things on the runtime stack, and will be looked at when we cover that topic.

Control Structures

● Implementing familiar HLL control structures: – if-then Control Structures – if-then-else CS2253, Owen Kaser – while – do..while ● Omit: switch

● See textbook Chapter 8

Basic Mechanism Nesting

● Essentially, to disrupt the flow of control you need to ● A typical HLL program has nested control structures: if set PC (alias R15) to a new value. inside of an if, inside of while... ● The b command does this ● We'll look at how to replace a HLL control structure (that might have another control structure within it) by ● But so does any other allowable instruction that writes corresponding assembly language. to R15! ● The inner control structure can be replaced similarly. ● Consider this instruction: ● In the following templates, the first use of newlabel1 … add R15, R15, R3 shl 2 newlabel9 means to generate and use a label that was Number of instructions skipped ahead depends on R3. not already in use. Any subsequent occurrence of, say, newlabel1 means to use that same label.

If Without Else Example

● Replace if () { } by ● a1 is in R1, a2 is in R2 code to test the condition (often using CMP) ● Translate if (a1 >= a2) { a1++;} b newlabel1 cmp R1, R2 code for body blt xyz0001 ; lt is opposite to >= newlabel1 add R1, R1, #1 ; translation of a1++ xyz0001 ; my new label

ARM Optimization If With Else

● If the body doesn't have nested control statements or Replace other statements that set the flags, can have the if () { } else {} with following code to test condition code to test the condition b newlabel1 code for the body, with every instruction conditional. code for body1 Eg b newlabel2 cmp R1, R2 newlabel1 code for body2 addge R1, R1, #1 ; add made conditional on >= newlabel2

Example ARM Optimization

● if (a1 >= a2) a1++; else a2++; ● Since the bodies are simple, can use predicated [i.e., conditional] instructions: ● Following the template: cmp R1, R2 cmp R1, R2 ; a1 >= a2 ?? addge R1, R1, #1 ; the “then” body blt xyz001 addlt R2, R2, #1 ; the “else” body add R1, R1, #1 b xyz002 ; don't fall into else code ● Look Ma, no labels and no branching. No xyz001 “branch penalty”. add R2, R2, #1 ; the else's body xyz002 While Statement Example

● Recall that a while statement checks the condition before every iteration, including the first. ● for (i=0; i) {} can turn into mov R1, #0 ; say R1 stores I b xyz001 b newlabel1 xyz002 newlabel2 code for add R3, R3, #1 ; body: say R3 has k newlabel1 add R1, R1, #2 ; code for i+=2 code for xyz001 b newlabel2 cmp R1, R2 ; say R2 has j blt xyz002 Other translations are possible, but this is the book's

Counting Down To Zero Do...While Statement

● If you can arrange for your for loops to count Translate do { } while (); as down from N to zero AND if it is guaranteed to do at least one iteration, better to use code like newlabel1 mov R1, #N ; counting down with R1 code for newlabel1 code to check condition code for the body of the loop b newlabel1 subs R1, R1, #1 ; set the flags

bne newlabel1 ● Slightly simpler than the while loop

Nesting Conditional Execution ● Using conditional execution, we can reduce ● Let's do Euclid's algo together: Euclid's code to GCD CMP R0, R1 while (a != b) SUBGT R0, R0, R1 if (a>b) a=a-b; SUBLT R1, R1, R0 else b=b-a; BNE GCD ● Book also shows how to use conditional execution to handle something like if (x==1 || x==5) ++x

Contents

Assemblers and Linkers ● Review of assembler tasks CS 2253 ● A look at linker tasks Owen Kaser, UNBSJ ● Assembler implementation ● The location counter and symbol table ● Two-pass assembler ● Macros and conditional compilation

Review of Assemblers Linkers

● An assembler takes commands and translates ● The assembler typically generates one “object them into what will be the contents of some code” (.OBJ) file, containing the contents of the areas. various areas. ● Assembler commands can be ● One source code file → one object code file. – directives, such as ● Libraries are also object code files. ● AREA foo, data [change the area being generated] ● Linker's overall job is to put together the various ● DCB “hello” [generate some byte contents in current area] areas in all the object files, getting an executable – instructions file that is ready to load into memory and run. ● ADD R1,R2,R3 [generate machine code bits in current area] – labels

● blah …. [record the current position in the current area as “blah”]

Relocation Externals

● Consider the following situation in area AAA ● A related job for the assembler is to handle cases foo DCD foo ; say this is 40 bytes into AAA where one source-code file referred to something … 100 instructions later that is defined in another source-code file. LDR R1,=foo ● ● The machine code for the LDR is really for At the assembly-language level, when you intend LDR R1, [PC,#-408] “relocatable” to use something defined in a different source code file, you declare that thing to be external. ● But the content of the variable foo is supposed to depend on where foo ends up in memory. The assembler does not know this. It just ● When you define something that you want to be knows that foo will be 40 bytes into its area. used in another file, you declare it global. ● Only the linker knows where foo will be located. At link time, say that AAA starts at 3000. The linker will fill in 3040 as part of its relocation.

● The assembler had generated a “fix me” note in the .OBJ file, recording something like “fixme: start of AAA + 40” for the linker.

Crossware's Example Assembler Implementation

include xstdsys.h ● Key data structures: extern_main,__initiostreams,__init_cvars,__HP global __cstart,_exit – a “location counter”. (Some assemblers let you use ;***************************************************************************************** its value as a constant, and they call it $) ; These sections are required by the Crossware C Compiler: – an array of area information

area __STACK,4,data,high ; Linker places this at highest available ram location ● a saved location counter space 1 ● a buffer of all code/data generated so far into that area org __LowestRomLocation ← new directive for you – a symbol table, mapping labels to their addresses. global __START ● __START * give the linker the start address address: probably an offset within an specified area dcdu __STACK ; Initialise supervisor stack for C compiler ● symbol table may also record a type, or whether the entry dcdu __cstart+1 ; Jump to __cstart on power up is global or external, etc. …...

Assembler : Rough Sketch Problem 1: External References Area ← default; $ ← 0; buffer ← empty buffer ● What if the assembler processes a line like Repeat get a line of text, parse it and discard any comments B foo if line has label L, then SymTab.put(L, Area, $) if line has directive AREA nm, type ( where nm is new) where foo is a label in a different source code file? Areas.put(Area, $, buffer); Area ← nm; $ ← 0; ● else if line has directive DCB Solution: x = evaluate_constant_expression( , SymTab) buffer.add(figure_out_machine_code(B, 0)) buffer.add_byte(x); $ ← $+1; else if line has instruction ADD rX, rY, rZ buffer.add() x = figure out machine code(“ADD”, rX, rY, rZ); buffer.add_word( x); $ ← $+4; $ ← $ + 4 // even if fixme note is large else if line has instruction B if is a label in SymTab whose area matches Area distance = ($+8 – SymTab.getValue( ))/4; ● The final object code file will have a linked list of the various x = figure out machine code(“B”, distance) “fixme” places. buffer.add_word(x); $ ← $+4 else ● At link time, the linker will know where “foo” will really be and ??? it will replace the offset of 0 by the actual distance. else if line has ….. Until line has directive END Problem 2: Forward References 2-Pass Assemblers

● Contrast this: ● Fairly easy solution to the forward-reference foo add r1, r2, r3 problem: Process the source-code twice. bne foo (foo is in the SymTab already) ● First pass: pretend to generate code for the areas, but when something (ie, a forward reference) is

● to this: unknown, just stick some padding (of the appropriate size) into the buffer. But create the symbol table. add r1, r2, r3 ● bne foo (foo may not be in the SymTab yet) Second pass: run through the code again, but using the symbol table made in the first pass to generate add r4, r5, r6 the correct code for forward references. foo add r7, r7, r7

Conditional Assembly Example Crossware Code

● At assembly time, you can test a condition ifeq __NoRom (usually based on textual or constant equality) ... and exclude the assembler from seeing a block dcdu HardFault+1 of code if the condition fails. …. ● Like an if-statement at assembly time. It elsec affects what code is actually placed into your …. object file. dcdu HardFault+27 ● One set of source code can generate machine endc code for slightly different platforms.

● C and C++ have this feature too.

Other Conditional Directives Macros (textbook p 73)

● ifeq checks whether is zero ● Macros allow a programmer to assign a mnemonic name to a bunch of assembler lines. ● ifge checks whether it’s >= 0 ● When the mnemonic is then used, associated ● iflt : <= 0 assembler lines are “copy pasted” into the source ● ifc , checks whether the two code (from the viewpoint of the assembler: actually, strings are equal to each other your actual source code is unchanged). ● Macros can have parameters that are substituted in the copy-paste process. ● Macros can function like assembly time methods.

● Not all assemblers support macros. Certain HLLs, notably C and C++, also have macros.

Macro in Crossware Crossware Macro

● Macros and conditional compilation in Crossware’s ARM assembler and silly macr ● Silly R0, R5 expands to their 8051 assembler appear similar. The 8051 is documented...(see ifc \0,R0 ← no space course website for link) mov R5,R0 mov \1, \0 ● foobar macr elsec ● Silly 18, R5 expands to ….. some lines of assembly (body of macro)… endm cmp R7, #\0 cmp R7, #18 beq \.0 ● To invoke: foobar R0, hello, 35 ← comma sep. args beq temp000 mov \1, #\0 ● First param is \0 in body of macro (here, it is R0) mov R5, #18 ● Also, \1, \2, … \.0 add R0, R0, R0 ● Labels in the macro body should be \.0 to \.9 endc temp000 add R0,R0,R0 (On each invocation of the macro, a different label will be used) endm

Use of Macros Repetition at Assembly Time

● Skilled assembler programmers (there are a few ● Sometime, you want the assembler to process the same block left…) often develop a library of macros that of code a bunch of times generate code for a variety of fiddly tasks. ● But you don't want to type it yourself ● Some assemblers (dunno about Crossware's) allow you to put ● The include assembler directive requests that the a REPT at the start of a block of lines, and ENDREPT at assembler read a named file (perhaps with lots of the end.

juicy macro definitions) and act as if the contents ● Like an assembly-time FOR loop running n times. had been pasted into this source code file. REPT 5 ● C language uses a similar mechanism. Java has ADD R1, R2, R3 ← assembler sees this 5 times a higher-level import idea. ENDREPT

“Macro Language”

● Together, macros, conditional assembly and maybe repetition essentially form a little programming language that runs at assembly time. ● If unlimited repetition (or recursive macros) are Calling Conventions and the Stack allowed, the macro language can be “Turing Complete” CS2253 - wait till you finish CS2333. Owen Kaser, UNBSJ ● In the 1990s, Shaw/McNally challenged me to implement numerical integration in the TASM macro language. ● Usefulness: compute a table of values to use in the “real” program. ● N.B. The macro language for C is not Turing Complete.

Overview Stack in Memory

● Stacks and the Load/Store Multiple Instructions ● Recall from CS1083 (or CS2383) that one can use an array to store a stack's data. Also need ● Subroutines an integer array index, called TOP. ● ARM Application Procedure Call Standard ● Push(value) → Data[++TOP] = value; – Code Linkage Mechanism ● Pop → return Data[TOP--]; – Parameter Passing ● – Caller- and Callee-Save Registers TOP was initialized to -1 and always points to the value to be popped. Stack grows up. ● See Chapter 13 of textbook ● ARM folk call this a “full ascending” stack

Full-Descending Stack Empty Stacks

● In low-level programming, a Full-Descending ● To ARM, an empty stack is where the top-of- stack is more common. ARM ABI requires it. stack pointer indicates where the next push will ● TOS is initialized to max_valid_index +1 go.

● ● Push(value) → Data[ --TOP] = value Push(value) → Data[TOP--]

● Pop → return Data[ TOP++] ● Pop → return Data[++TOP]

● We decrement before on push and increment ● Decrement After (DA) and Increment Before(IB) after on pop. for an “empty descending” stack.

● DB and IA. ● Empty ascending stacks are also possible.

ARM Assembly Push An Un-RISCy Quirk

● Use a form of the store multiple (STM) instruction for ● We'll soon see that programmers often want to push (or push, and a form of load multiple (LDM) for pop. pop) a bunch of registers to the stack.

● Top-of-Stack is usually the R13 register (alias SP) ● RISC approach: need 6 instructions for 6 pushes

● Push(value in R5) → STMDB SP!, {R5} ● ARM: have a single complex instruction STM to push several registers. Requires multiple clock cycles. ● Pop (from stack to R5) → LDMIA SP!, {R5} ● STMDB SP! , {R12, R3-R5, R7, R8} ● This is for a Full-Descending stack, so you can use STMFD and LDMFD if desired (maybe). ● Pushed from smallest register to largest, and the ! ensures that SP gets changed. ● Crossware does not support PUSH and POP (textbook 13.2.2) ● LDMIA SP!, {R8, R7, R12, R3-R5} will restore them.

LDM/STM encoding Full Descending Stack

StackBottom SPACE 1000 ● StackTop ; label to mark address beyond top ● ….. ● P=1 means Before

● U=1 means Upward MOV SP, #StackTop ; initialized ● W means a ! was used ● L=1 means LDM instead of STM ● Now we can push and pop, up to a limit of 250 ● S=1 means some weird behaviour that depends on processor mode; see technical docs 32-bit values.

● The register list is a bit vector; bit k means Rk is included in the set of registers loaded or stored.

Why Push and Pop? Subroutines

● The stack is convenient for saving registers (push them to ● A subroutine is essentially a method (that doesn't save; later pop them to restore). belong to an object).

● Let's suppose you have a some valuable info in R3, R4. ● You call a subroutine and return from it. You are about to enter into some code that trashes R3 and R4. And after you exit from that code, you have some use ● A reentrant subroutine can be paused while running, for the old R3 and R4. and while paused, another copy of it can start running.

● STMDB SP!,{R3,R4} ; save R3 and R4 When finished, the paused subroutine is allowed to continue...and no problems arise. … code that trashes R3 and R4... ● LDMIA SP!,{R3,R4} ; restore saved registers Your most familiar reentrant situation is with recursion. (Later: we will see “interrupts”) … code that uses R3 and R4... ● Naive coding of subroutines leads to non-reentrancy. Fancy coding with stack frames will give reentrancy.

Getting Into/ Out of Subroutines Example Subroutine

● If you have some code you wish to call, use a BL ; this subroutine multiplies the value in r0 (branch-and-link) instruction. ; by 10 and returns result in r1. Trashes r2. ● Like B, it changes the program counter so you jump times10 mov r2, r0, lsl #3 ; r2 = 8*r0 somewhere (the first instruction of the subroutine.) add r1, r2, r0, lsl #1 ; r1 = r2 + 2*r0 ● But R14 (alias LR, the link register) is automatically set to the return address of PC-4 (the address of mov pc, lr ; return the instruction immediately after the BL instruction) …... ● at the end of the subroutine, arrange for the return address to go into PC. mov r0, #5 bl times10 ; invoke subroutine

…. Passing Parameters Subroutines Calling Subroutines

● Our example passed an input parameter by using a ● Commonly, one method calls another. register. ● Say main() calls func1(), which then calls ● And it used a register to pass back a return value. func2(). Don't lose the original return address! ● This is a common approach.

● Another alternative is to pass parameters on the stack (caller pushes them). ● The callee then accesses them in memory.

● When subroutine is finished, either the caller or callee must pop off the parameters so stack doesn't overflow.

Saving/Restoring Registers: 1 Saving/Restoring Registers: 2

● Suppose you are going to call a routine that's ● Suppose you're writing a subroutine. You've been told documented to trash R4 and R5. Currently, you have you're not allowed to trash R4, but you are allowed to something valuable in R4 that you will need later. trash R5. But the value in R5 doesn't matter anymore to you. ● You really need to use both R4 and R5

● Caller-save scheme: ● Callee-save scheme: push just the value in R4 mysub STMDB SP!,{R4} ; save R4 call the routine, using BL ...code trashing R4 and R5 pop into R4 LDMIA SP!,{R4} ; restore R4 MOV PC, LR ; return

Interoperability AAPCS (Textbook 13.5)

● I want to be able to call your subroutines, and you ● The ARM Application Procedure Call Standard is our want to be able to call mine. binding contract. If we both follow it, our code will interoperate smoothly. ● So I need to know how you expect parameters passed in, and how return values should be handled. ● It's an example of a calling convention.

● I'd like to know whether I can trust you not to trash my ● 3 parts to the contract registers' values. – obligations of caller to set things up for callee ● I'd like to know if there are any registers that I can – obligations of callee not to (permanently) trash parts of the trash, without having to save/restore them. state of the caller ● We need some rules... – rights of the callee to trash other parts of the caller's state.

AAPCS: Parameter Passing AAPCS:Callee-save

● R0 to R3 have the parameter values. Caller places ● The callee must preserve R4-R11,SP (so they are them there. “callee-save registers”)

● Callee places return value(s) in R0 and R1 ● R4 to R11 typically contain local variables of the ● Caller is otherwise free to trash R0 to R3 (so they caller. And the caller expects the SP to come back are essentially “caller-save registers”). unchanged.

● ● Subroutines with more than 4 parameters will have It is very normal to do all the callee-save pushing the caller push the extra parameter values on the as a single STMDB instruction at the very start of stack. [not sure who is responsible for their the subroutine. This said to create a stack frame. cleanup: guess it is the caller, who made the mess.] ● And the corresponding popping is a single LDMIA instruction that is at the end of the subroutine.

AAPCS: Status flags AAPCS: Link register

● You cannot expect the status flags to be ● The caller will have given us an R14 value that tells us where preserved during a call. to return to when finished. ● Any BL that we do will destroy it. Very bad. ● So essentially, they would be caller-save. – Case 1: we are a “leaf procedure” (ie, make no calls). Just don't use R14 for anything. Finish with ● Except that it's rare to need an earlier- computed status flag after the subroutine MOV PC, R14 – Case 2: we are a non-leaf procedure (and can potentially make a returns. call). Push R14 with the callee-save registers at the start of the subroutine. And pop it (into PC) with the LDMIA instruction when finishing. – This works because of the order in which LDM/STM stores regs.

AAPCS: R12 Warning

● Textbook indicates that R12 is somehow used ● Deviate from AAPCS at your own risk. by the linker at the point when the caller ● First, your code won't otherwise interoperate. invokes the callee. Mysterious. ● Second, DIY approaches to these things tend to ● You are allowed to trash it, but you must expect fail in horrible and utterly puzzling ways, the calling process to potentially trash it. especially if recursion is involved. I don't want to ● I.e., R12 is a caller-save register. be responsible for your soul-destroying all-night debugging session that makes you switch to a BBA degree. (The world needs you in CS: follow the AAPCS...)

Let's Code the Recursive Fibonacci

● int fib(int n) { int temp, temp2; if (n < 2) return n; ARM Exception Handling else { CS2253 Owen Kaser, UNBSJ temp = fib(n-1); temp2 = fib(n-2); return temp+temp2; } }

Overview Exceptions ● Warning: hardest parts of CS2253.

● Back to Chapter 1: Processor Modes & Vector Table ● Sometimes, the normal flow of control needs to be unexpectedly modified ● Concept of Exceptions – – Interrupt Handlers A character is received on the keyboard – – Priority Levels An access is made to a memory location with no memory/device there ● Software Interrupts – The processor is asked to execute a bit pattern that ● Memory-Mapped Input/Output is not a valid machine code ● Two main cases: interrupts and errors ● See Textbook Chapter 14, 16

Interrupts Why Interrupts?

● An interrupt can occur because a hardware device wants ● There can be a lot of asynchronous things happening in a service...now. computer system. Having one program keeping track of them would be hard. ● Device: “I have just received a character from the keyboard, and my buffer is only 1 character deep. Please ● (Eg, every loop that ran more than a few microseconds stop whatever you are doing and remove/process this would have to contain code to check for input/output) character ASAP, so that I will be able to accept the next ● System running one dedicated program can use this character the keyboard sends”. approach. Maybe it even waits, looping, while I/O happens.

● CPU: “ok, I see your interrupt request. I am interrupting my ● This is called polling I/O. It is generally viewed as unweildy. normal execution to switch to handler code for you. When I finish, I will return to my normal execution, where I left off.” ● Better to just have the I/O device interrupt.

IRQ and FIQ interrupts Software Interrupts

● The ARM ISA defines regular (IRQ) interrupts ● Some ISAs, including ARMv4, have a special SWI and fast (FIQ) interrupts. instruction that, when executed, causes the system to act like a hardware device requested an interrupt. ● Stay tuned. ● A hardware interrupt is like an unscheduled subroutine call that also puts the processor into an more privileged mode. Handler code is trusted and part of the operating system.

● So an SWI instruction is often used to invoke an operating system service subroutine.

● Book refers to the SWI instruction under its new name, SVC, but Crossware still uses SWI.

Error Exceptions Overall Approach

● Undefined instruction ● For an exception, we need to – Can be intentional (emulate a “missing” instruction) – save the current state (including CPSR) ● Prefetch abort – Reset the PC to the handler code [ & change mode] – An attempt to fetch instruction fails (eg, PC is not a valid memory location) – Execute the handler ● Data abort – Restore the saved state, including the PC & mode – A LD or ST with an illegal address ● State saving and PC resetting are done by – A store to a read-only address hardware. Handler and restoring done by ● Sometimes, the response should be to die gracefully. But other times, we may be able to recover and continue. software.

Modes Processor Modes (Book Fig 2.1)

● See textbook Sections 2.3.1, 2.3.2.

● Normal code executes in User mode. In processing exceptions, modes are System, Undef, Abort, IRQ, FIQ, Supervisor.

● In some of these modes, some of the registers are banked out. The User version of R11 is hidden from use, replaced by another R11 when in FIQ mode. (So the User version of R11 is safe from modifications.)

Registers in Different Modes Recognizing Your Mood Mode

● The CPSR stores more than your status flags. A 5-bit field M4:M0 stores a flag indicating mode (Table 2.1) – eg 10000 for User, 11011 for Undefined, etc.

● I bit enables or disables IRQ interrupts

● F bit enables or disables FIQ (fast) interrupts

● T is status: are you in Thumb mode?

Details Invoking Exception Finding the Right Handler ● For the different kinds of exceptions, there are ● See textbook 14.4. different handlers. When an exception occurs, the hardware determines the source of the ● First, CPSR copied to SPSR_ exception as a 3-bit number, which it uses to ● Adjust CPSR (mode bits, disable IRQ, maybe index the vector table (which starts in memory disable FIQ) at address 0).

● Store return address to LR_

● Set PC to start of relevant handler

Textbook Figure 14.3 Returning After Exception

● When the handler has finished its task, it returns to the caller (in software)

● The mode needs to be put back to its pre-interrupt value. And the PC needs to be put back to the correct instruction. – Either to the instruction that had the exception (and did not successfully finish) or to the next instruction. Case depends on the kind of exception ● SUBS PC, LR, #4 or SUBS PC, LR, #8 ← magic CPSR restore when PC is the destination and the S flag set

● Or on entry to handler – adjust LR (eg subtract 4) – STMDB sp!,{some regs, lr} ● then use a LDMIA sp!, {some regs, pc}^ to return.

● ^ means to restore CPSR also.

Multiple Stacks Mrs. MRS

● Interrupt code typically uses stacks. And there ● MSR moves to a status register. is a separate R13 for each mode (except one). – Status registers are CPSR, SPSR So there is a separate stack per mode...and at – Underscores after (eg CPSR_cf) indicate which machine startup, it needs to be initialized. sub-parts of the status register are affected. Used ● Initialization via a MSR (move into status in book code but not described… _cxsf is all of it?. register) instruction to change mode. ● MRS moves a status register into a regular ● Then store a value to (that) SP. register.

● Then use a MSR to put mode back

Implementing ADDSHIFT Priorities

● See Example 14.1 for ● In order of decreasing priority, we have the hairiest program – Reset of CS2253. – Data abort – FIQ – IRQ – Prefetch abort – SVC and Undefined Instruction ● A higher priority exception can interrupt the handling of a lower priority exception, but usually not the other way.

Vectoring IRQ Interrupts Priorities in IRQ Textbook Figure 14.5

● Even within a given exception (eg IRQ), some hardware units (eg disk) are more urgent than others (eg keyboard).

● To prioritize, could OR together all interrupt request inputs. Then software can check each possible device to see who’s knocking...starting with the most urgent.

● Or a special priority device, a VIC, can take care of this. – Devices' IRQ lines go to VIC – Only VIC actually interrupts CPU – CPU can ask VIC for the handler address of the highest priority active interrupt request. [talking to devices: stay tuned!]

Software Interrupts Talking to Devices (Ch 16)

● Not dealing with co-processors in CS2253.

● Instead, we are dealing with attached devices such as ● Software can generate an exception. Use SWI to – UART (“serial port”) request an operating-system service. – Timer ● SWI handler has to use the value in R14 to find the – Analogue to Digital (A-D) and Digital to Analogue actual instruction, in order to extract the “SVC number” converters field and thus know which OS service was requested. – Disk controllers ● Assembler example: SWI 234 – General Purpose I/O (GPIO) connections that can control electronic devices (LEDs, motors, ...)

Special I/O Instructions Memory-Mapped I/O

● Some ISAs (not ARM) have special instructions to access ● Alternative: memory-mapped I/O, where certain devices “memory” addresses have no RAM. ● Intel x86: IN and OUT instructions. ● Instead, they are assigned to devices. No ● Devices are assigned 1+ numeric “port addresses”. distinction between “port addresses” and ● CPU does an OUT instruction to the port address of the UART to give it a command “memory addresses” ● ● CPU does an IN instruction to the port address to get a byte Now, ordinary LD and ST instructions (to the of data from the UART, or check its status. right “memory” addresses) can talk to devices. ● Frequently, a device has several (usually consecutive) port ● No IN or OUT instructions: RISC-ish. addresses: – 1+ status port, 1+ control port, 1+ data output port, 1+ data input port.

Book Example SoC's Memory Map

● LPC2104 System-on-Chip has an ARM processor core and a bunch of memory-mapped devices (peripherals)

● Some are internally connected via an “AHB” bus and some via an “VPB” bus.

● All VPB bus peripherals are in one range of addresses (0xE000000 to 0xEFFFFFF)

● UART0 device occupies 0xE000C000 to 0xE000C01C

● Every 4 addresses, we have 8 bits of data. Use LDRB or STRB to access, avoid endian-ness issue.

Zoom Into 0xE000C000 area UARTs are Fiddley

● Serial communication is quite hairy. Lots of communications parameters to be set. You're lucky this era is largely past us.

● After setup, sending characters is pretty easy.

● Bit 5 of the Line Status Register (memory mapped to 0xE000C014) tells us if the transmitter buffer can accept another character now. (0 means “yes”)

● Transmit Holding Register (0xE000C000) is where we can put the next character.

Polling I/O Example (p 347, mod) Polling vs Interrupt Driven I/O

; ASCII code to send is in R0 ● Note how the “wait” loop repeated asks (polls) the device to see if it is ready.

LDR R5, =0xE000C000 ● If not, it just tries again...and again...and.... wait LDRB R6, [R5, #0x14] ; Line Status Reg ● No other work in the system can be done...tight polling loops CMP R6, #0x20 ; Risky check of Bit 5 waste system resources.

BEQ wait ; spin until ready ● Better to flip some bits in 0xE000C004 and get the transmitter to give us an interrupt when its buffer gets some space. In the ; if we get here, bit 5 = 0 meantime, we can be doing other things (eg, getting input, STRB R0,[R5] ; write to transmitter buff. doing calculations). ● This “Interrupt-driven I/O” is generally better, though more complicated.

Strings

● In C and some other low-level languages, strings are just consecutive memory locations C-Style Strings that contain characters. A special “null CS2253 character” (ASCII code 0) terminates the string. Owen Kaser, UNBSJ ● Common string-processing library routines are good source of assembly-language examples.

Making a Constant String A String Local Variable

● (Review) Use DCB and don't forget the null ● Suppose you know you need a string local variable. If character terminator you know the maximum length you could possibly need (say 50 characters), proceed as follows.... ● mystring dcb “hello”,0 ● mySubroutine STMFD SP!, {some regs, LR} SUB SP, SP, #52 ;maintain SP alignment MOV R0, #0 ; null character STRB R0, [SP] ; terminate string (show picture) … use space from SP to SP+51 for your string.. ADD SP, SP, #52 ; pop off space used by string LDMFD SP!, {some regs, PC}

Stack Smashing Returning a String

● Q: What if someone is allowed to put a 56-byte string into your 52 byte area? ● Suppose your subroutine is supposed to return a string. ● A: You affect the things in the memory addresses above your ● You can just return the memory address of somewhere string. in memory that holds the characters of your string. (In ● The last thing pushed by the STMFD was the return address. C terminology, you return a pointer to your characters.) So you have a wrong return address. ● But that somewhere needs to be “safe” - not subject to ● A cracker can write some nasty machine code program as the arbitrary destruction. 56-byte “string” and arrange for you to return to her program. ● Any stack location below the top of the stack is not ● Moral: String locals need to be very carefully checked to see that they are not too long. safe.

● Some modern CPUs will mark the stack region of memory as “nonexecutable” to help. You can still be forced to return to an arbitrary location in the existing program, may be good enough for cracker. Bad Scenario Bad Scenario, picture 1

● main subroutine calls foo

● foo has a local string variable, v, that it puts some lovely string into.

● foo returns the address of v to main

● main turns around and calls bar

● bar returns. main tries to use the lovely string. Unhappiness results.

Bad Scenario, picture 2 Non-Reentrant Solution

● Because the string ● If a subroutine S needs to return a string (whose address sent by 'foo' maximum length is known), then it can put the string to main was in the in a “buffer” memory location set aside just for S. danger zone, 'bar' And it can return the address of S to its caller. trashed it. Not bar's ● S's buffer is safe enough...except from itself. This fault. approach means S won't be entrant – S cannot be recursive. ● Solution: Never return the address of a local ● And callers to S should copy out the answer, in case anyone they invoke also calls S. variable.

Example Length of a String (in R0)

S_buffer DCB 0 strlen mov R1, #0 ; length counter SPACE 31 ; total length 32 loop ldrsb R2, [R0],#1 ; get current character S STMFD SP!,{...,LR} addne R1,R1,#1 … put some string into S_buffer... bne loop LDR R0, =S_buffer ; return value in R0 mov R1, R0 ; return value in R0 LDMFD, SP!,{...,PC} ;return to caller mov PC,LR ; return ● Since this is a leaf method, we didn't need STM and LDM

Reverse (buffer version, untested) Or, Use a Stack rev_buffer SPACE 32 ● Can push a bunch of characters to stack from reverse mov R1,R0 ;R1 is caller save stmfd SP!, {R1,LR} input. (And count them). bl strlen ;length in R0 ● mov R1,#0 Pop them off, one at a time, and append to ldr R2,=rev_buffer buffer strb R1, [R2,R0,LSL #0] ; mark end sub R0, R0, #1 ● Then return address of buffer. ldr R1, [SP,#4] ; recover start of input loop ldrsb R3, [R1],#1 ;the copying loop beq done strb R3, [R2, R0, LSL #0] sub R0, R0, #1 b loop done ldmfd SP!, {R1, LR} ldr R0, =rev_buffer ;return value mov PC, LR

Alternative Approach Reverse (param 2 has address)

● We can make the caller responsible for finding reverse mov R2,R0 ;R2 is caller save stmfd SP!, {R2,LR} space for us to store the returned string. bl strlen ;length in R0

● mov R2,#0 The address of the space for the returned string ldr R1,=rev_buffer (probably in the caller's activation record) is strb R2, [R1,R0,LSL #0] ; mark end passed as a parameter. sub R0, R0, #1 ldr R2, [SP,#4] ; recover start of input ● This is a little better than the buffer approach. loop ldrsb R3, [R2],#1 ;the copying loop beq done strb R3, [R1, R0, LSL #0] sub R0, R0, #1 b loop done ldmfd SP!, {R2, PC} ; no return value

Making It Robust

● When the address of an output buffer is passed in, you should usually pass along another parameter to indicate how long the buffer is.

● And the string routine should be coded to avoid overflowing the buffer.

● Without the “how long” parameter, the string routine would have no way of knowing when overflow might occur.

● Early design of the C string library didn't really seem to appreciate this enough. Later additions did, but by then, programmers had developed sloppy habits.