Dr. Ernesto Gomez : CSE 401 Chapter 5 topics. We will concentrate on sections 5.3-5.8, with some excursions into others if there is time

1. Memory The so-called is something that comes up in the Von Neumann model of computation. A brief recap - our prime model for an algorithm is the Turing Machine - a finite state machine that stores program, data, and work space on a rewriteable tape. The key is being able to move back and forth on the tape, and the ability to write something, go do something else, then come back to what you wrote, in any order and any number of times. Consider our three basic machines - Finite Automaton - no memory, information is stored in the structure but cannot be changed ; Pushdown Automaton - add a stack, now information can be stored, but can only be retrieved in specific, in this case in last-in, first-out, and to get at infromation deeper on the stack you must delete information above it ; Turing Machine - replace the stack with a tape, now you can read-write stuff in any order, access as many times as you need to. Unlike simpler automata, the T.M. can store information about what to do when it reads data - it is programmable. Nothing we know is more computationally powerful than the T.M. The VonNeumann machine, which is the model we use in moat modern comput- ers, is no more powerful than a T.M. in terms of problems it can solve - but it is faster and easier to program. The idea is, we have a CPU with a limited amount of local workspace, and we have a separate memory, which can store arbitrarily large amounts of data, and which allows (the random access property - accessing the information stored in any address in memory takes constant time). Immediately this gives the VonNeumann machine a speed/ease of programming ad- vantage over the TM - accessing data on a tape takes linear time, because you need to move the tape to the data location, and acess cn be done in a single instruction without needing to loop over the tape to reach the data. TM and VonNeumann are not the only models we could use, neural networks, the lambda calculus, cellular automata, as well as other models are all computationally the same as a TM, each has advantages and disadvantages., and different data access models. Von Neu- mann is currently the model that we can most easily build, and that we know how to program. 1.1. The memory hierarchy. We generally have multiple levels of memory, how many levels and what they are depends on physics, technology and commercial reasons. The first computers we built using the VonNeumann model were just that - CPU and memory. But the memory split into multiple layers - a modern machine might have : CPU - ( may be multiple levels) - RAM - SSD(solid state drive) - hard drive (mechanical) - removable storage (tape, DVD, other). 1.1.1. Physics and engineering: It is not physically possible to build memory that combines speed, arbitrary size, and random access. Two basic reasons: anything we build has finite size, the speed of light =∼ 3x108meters/second, and we cannot communicate information faster than light (electric current transmits signals slower than light, it is almost c on a straight highly conductive wire, slower in circuits and semiconductors or other materials. The only way we could get real random access is by arranging all the memory in a sphere of fixed radius r around the CPU. Since 1 2 our memory elements are finite size, the amount of memory elements we get is limited by the area of the sphere, so if we need to add memory, r increases and it gets slower. It is not practical to enclose a CPU in a sphere of memory, anyway. If we arrange memory in a flat grid (how we build circuits), then it is not random access because different memory elements are at different distance form the input/output connection. We can simulate random access (recall - random access means same time, not fast!) by using a clock driven circuit (like we do on the CPU), and setting the clock rate to allow the slowest access time of any memory element on the chip - like on a pipeline, the cycle time is the time needed by the slowest element. By the way - this implies, if you have multiple memory chips in your computer, the clock time is set by the slowest chip (best performance when all the chips match!). It also implies, all other things being equal, bigger memory capacity → more circuitea → longer wires → slower. How much does this matter? Convert units on speed of light value, we get 30cm/nanosecond -10cm is about 1/3 nanosecond. Since a memory chip is physically much smaller than 10cm, this may not matter much in an individual chip - placement on the motherboard could matter, however. Recalling that wires don’t always take a direct path between any two points in a circuit, you can see that, for a 3GHz clock (typical midrange workstation), it can take one or two CPU cycles for a signal from the CPU to reach memory.

1.1.2. Technology. We can store bits in many ways - the fastest right now is as the on-off state of a transistor, this is what we do in CPU registers. The disadvantage is, they require power to hold the state. Many transistors packed into a CPU chip use a lot of power, generate a lot of heat particularly if we want to be able to switch states in under a nanosecond. Also, it is expensive to build the chips, between expense and power requirements we can not build large amounts of storage this way. We do transistor memory in CPU working memory, and also in cache on the CPU chip - anywhere else it is too expensive and takes too much power. With current tech, we can put megabytes of memory on the CPU chip, but we need to hold programs, data and the . Current memory is built on capacitors - these store a voltage from 0 (value 0) to between 1 and 2 volts (value 1) in current DDR2 and higher memory. It would be theoretically possible to store more than one bit by using a range of voltages, but storring a single bit is much more error resistant, because we don’t care what the exact voltage is, just if there is a voltage. Voltage in a capacitor depends on a difference between - and + charge on different plates with an insulator between them, and the attractive electric field between the two charges tends to maintain the voltage without power consumption. Capacitors leak due to contact with surrounding material, so they need to be refreshed periodically - circuitry in the memory chip periodically reads and rewrites all bits, which resets the charge for bits set to 1 (this only happens about every microsecond, for memory with a clock of 1 to 2 GHz - so it does not greatly reduce performance (see Wikipedia - “ refresh” for more info). This applies to Dynamic RAM - DRAM. There is Static RAM (SRAM) but it costs more, has more complicated circuitry, and has lower storage density so is used mostly in sppecialized applications. DRAM means when we turn off the computer we lose the memory contents (because memory refresh takes power). So we use other devices - typically solid 3 state drives (SSD) or spinnig drives (classic hard drives with magnetic storage on a spinning disk) to store data and programs when the power is off.

1.1.3. Memory hierarchy. The “memory hierarchy” is the term used to describe the multiple layers of memory storage required by our machine model and technology. TM, of course has only one layer of memory - the tape. VonNeumann architecture explicitly names two layers - 1. working storage inside the CPU - (current tech is registers, but stack has been used in the past). 2. RAM. In modern practice, we have the following: (1) CPU registers - work space for computation. fastest, high power draw, com- plicated circuitry, high cost - generally between 10 and 100 registers, total storage ~8K access time <‌< 1 cycle, for cycles faster than 1/2 nansecond (2) Cache - fast storage for program and data we are processing - in the CPU chip - access time between 1 and ~10 CPU cycles (mostly from more com- plex addressing and longer wires), same storage technology as registers, but simpler circuitry so lower cost per bit. Usually 1-4MB per CPU core. Some systems also have off-chip cache, this can be up to several hundred MB, but runs slower than the CPU. Latency is between on-chip cacche and DRAM. (3) RAM (SDRAM) - capacitive storage, clock typically between 1 and 2 GHz bandwidth, but latency is ~10 to 60 nanoseconds. Currently over 8GBytes up to ~1 Terabyte for large multicore servers. Cost per bit much lower than cache (4) SSD - permanent solid state storage - access time ~microseconds, size ~.5 or 2 Terabytes, cost per bit lower than RAM. These have come down in price and have become much more common in the last ~5 years. Sometimes a smaller SSD is combined in the same machine with a larger, cheaper hard drive. SSD storage is subject to wear, can only take a certain number of read/write cycles. This is getting better, but still not considered as permanent . (5) Hard drives - permanent spinning magnetic disks. Lower cost per bit than SSD or memory, size ~1 to 10 Tb, latency in miliseconds. Although mag- netic storage does not wear out like SSD, the mechanical parts of the drive (read-write heads, spindle) are subject to wear and eventual failure. (6) Permanent offline storage - ideally, archival media that will last indefinitely. This has been floppy disks, cassete tape, writeable (optical) media : CD, DVD, real to real tape. Hardware for all of these except reel-reel is cheap, but cost/bit suffers because media size (again, except reel-reel) is under a few GB. Removable USB sticks fit here as well, can be up to 1TB, cost/bit similar to SSD - similar technology. All modern computer systems have devices 1 to 3, and either 4 or 5, or both. Offline storage (6) is USB for workstations, sometimes reel-reel for servers. Cost, speed and time issues mean a lot of offline storage is actually arrays of hard drives (5) . In this class we are concerned mostly with categories 1-3 above.

2. CACHE 2.1. Why it works. Cache may be divided into several layers - L1, L2, sometimes L3 on the CPU chip, and with the same clock rate. Higher numeric levels are larger, but have worse latency. (L4 cache is typically not on the CPU chip). We start by 4 considering L1 cache, which generally has the same access latency as a register - a word in L1 cache can be copied to a register in one cycle, and a write to L1 also happens in one cycle. It would be possible to have instructions that explicitly move data and program from memory into cache - before , it was common to have programs that would not fit in memory divided into multiple chunks which would call each other from disk, for example, something similar could be done with cache. Instead, just as virtual memory simulates a much larger memory than what is actually available, we are going to use cache to simulate a much faster memory than we actually have. Specifically, we will use cache to simulate a memory as large as the real or virtual memory, wuch that any location can be accessed in one CPU cycle (this means we are trying to simulate a physically and technically impossible large random-access memory). We will not produce a perfect simulation, there will be failures - modern cache will however simulate a CPU speed random-acces memory at least 95% of the time. To do this simulation, we need the following: Requirements: (1) A way of copying memory we are going to need into cache before we actually need it (2) A way of finding where a particular memory address has been copied into cache which takes much less than one cycle. (3) A way to ensure that any modifications to data copied into cache are up- dated into memory before we need to read that memory location again. (4) A way to manage cache contents to effectively use of the space in the cache, which is much smaller than the DRAM, so we need to move data out of cache to allow new data to be copied in, and preferentially move data that we are not going to need in the future We would like to do all of the above without user intervention. It would seem that we need to be able to look into the future for requirements 1 and 4, at least, and this would be helpful for requirement 3. We have an alternative which is physically possible: Locality of Reference (Requirement 1) This is not a law, it is not even a principle as it is sometimes called - it is more like an observation about our programming habits, and the way w e write compilers. The observation is: If we have just accessed address A, then we are much likelier to access A+1 than A+2x10^6 : this is spatial locality of reference - If we have used a particular address A, we are likely ot use values near A. If we have used A, we are likely to need A+1 and other nearby locations soon, and the more time passea after using A, the less likely we will use it again soon: spatial locality of reference. Summarizing: usually, we will write programs so hey have a relatively small of data from a small subset of the memory (spatial locality), and that set changes slowly as the program advances (temporal locality). Locality of reference is the reason why cache works. We can recognize pro- gramming habits that imply locality of reference, and recognize things that would reduce locality. Normal program execution means we execute the address in the program counter (PC) then increment PC and execute the next address - at any given time we have executed, and will execute, instructions from a particular range of addresses, and that range chnges in an orderly, predictable manner. Compilers 5 tend to allocate static data in the same general area of memory, arrays are always allocated in a single chunk of memory and objects are usually allocated as well - and if we are using an array or vector, we tend to repeatedly use it in a particular chunk of code, and we frequently access it in an orderly way such as inside a loop. Function calls on the system stack are allocated in countiguous memory (starting from the highest address and going down from there), dynamic data gets allocated on a heap that grows upward. Both code and data tend to get allocated in compact areas, leading to a working set of code and data that is divided into a small number of chunks of contiguous memory. This preserves locality of reference. Some programming styles do not work as well with locality of reference: A large number of jump statements in the machine code, with short sequences between the jumps, break locality for the instruction stream, large arrays with a random pattern of reading single values of data mean that subsequent addresses are not necessarily anywhere near each other, again this breaks locality. You can see other cases that would reduce locality - for example a lot of variable allocated and removed from the heap may lead to fragmentation in the hep (for example delete an object, then you allocate a new object, it probably will not be placed in the hole in memory freed by the deletion). Problems can also happen with, for example 2D or higher arrays - for example, two coordinates in a 2D array will be mapped in some order to linear memory addresses, but this means that if two adjacent indices on the same row are in contiguous memory, two adjacent indices on the sam column will not be (becasue we fill the memory addresses one row or one column at a time). These programming instances will work less well, and sometimes not at all well, with cache. As it happens, patterns of code that preserve locality of reference are more common than those that do not.

2.2. How cache works. The first time we load an address to a CPU register, it is typically not in cache, we have a cache miss and we need to load the data into cache (we usually write it to cache and the register in parallel). To use locality of reference, we load the address we requested, and a block of nearby addresses - into cache into a block of cache usually called a cache line (historical reasons). On a linux system, the command “cat /proc/cpuinfo” will display information about the CPU cores, the line labeled “cache_alignment” is the size in of a cache line. On current machines, this is typically 64 or 128, which is 8 or 16 words (on my intel 7600 laptop, it is 64). By locality of reference, we are likely to re-use the address we just loaded, and use the adjacent addresses as well, so our first cache miss gives us a good chance that many later memory requests will already be in cache. Loading a cache line rather than a single item also improves memory performance - in memory with 1GHz clock, for example, it could still take 10 nanoseconds before the first 8- word arrives. With a memory bus 64 bits wide (1 data line per bit in a a word), we would get succeeding words from consecutive addresses every nanosecond, so the entire line would be read into cache in 18 nanoseconds, an improvement by a factor of more than 5 over 10 individual word reads (like a pipeline, once things start we get one item per cycle - note that these are memory cycles, not CPU cycles, memory has a different clock) Consider an L1 cache with 256Kbytes capacity - this is 218 bytes - 64 is 26, so this cache would have 212 - 4K cache lines. Assume our system has 42 bit memory addresses (264 bytes is of the same order as the toatal amount of memory in all 6 the computers in the world - considering that 240) is a terabyte, four times that is not an unreasonsuable memory lim.it (for comparison, my laptop has 39 bit real addresses, 48 bit virtual). Divide 242/26 gives 236blocks the size of a cache line in memory - we have 212lines in cache, so there are 224 = 16 million blocks in the address space for each line of cache. We are unlikely to have the maximum amount of memory - with a likelier 236=16 GB of RAM, we have 218blocks in memory (256K) for every line in cache (this is higher than typical for a real system, but in reality we would have two megabyes of total cache in the system we are describing). The actual cache needs to store information in addition to the data - it inmedi- ately needs 2 additional pieces of information. Firstly, since a cache line may come from multiple places in memory, we need a way to track where each cache line came from in memory - This could be a full memory address, but it could also be information that allows us to construct the address - for example, since we have in this case 236 64 byte blocks in our 42-bit address space, then the address of each block has 36 significant bits, from bit 41 to bit 6. The 6 low bits, positions 5-0, would idetify a particular byte in a 64 byte block (note that 26 = 64), but the block address itself is the adress of the first byte in the block, which would be [36 bits]000000 - the 36 high bits followed by 6 zeros. We call the high 36 bits the tag, and by combining the tag with the 6 bits specifying where a byte is in a block we get the location in memory of that byte. Each entry in the cache needs the 64-byte line of data, the tag, and a single valid bit - this is because when we turn on the computer, or load a program, the cache would be full of random or old data, and we don’t want to mistake it for real data. The valid bit is set when the data in the cache line and tag correctly contain real data that is at a real address, otherwise a valid bit set to 0 means there is nothing meaningful in a cache entry. To initialize the cache all we need to do is set all the valid bits to 0 (a simple circuit can zero all the valid bits).

2.2.1. Cache addressing -fully associative (Requirement 2). A cache capacity of 256K is not actually all the memory we need for the cache - since every one of our 212 cache lines could potentially come from a different place in memory, we need to store the address from which we read a particular cache line together with the data - in addition to the 64 bytes of the line, we need up to 42 added bits to store the memory address. We can reduce this if we realize that, when we read a block, we really don’t care where in the block is the actual data we want. If we just treat the main memory as if it were divided into 64 byte blocks, then all we need is the high 36 bits of the block address - if we have a 42 bit memory address, and the top 36 bits match our address, then we can find the actual data by just using the bottom 6 bits as the location in the block. We call the high bits of the address the tag, is it what we use to identify where in memory we got the data that is in a cache line. In our case, the address of the cache line contents in memory is: | tag: 36 bits, 41 to 6 | line: 6 bits, 5 to 0 | This is a fully associative cache : With this scheme we can place any memory address anywhere in cache - the location of our particular item in the cache line in memory is just: tag+(bottom 6 bits of address), and the location in the cache line is just (bottom 6 bits of address). This matches our intuition of how a cache should work; we copy a block of memory anywhere into the cache and then the CPU can read it in one cycle. 7

Problem - We have memory addresses in the code, they give us where the data is in a cache line, but not where the line is in cache. To find it in the cache, we would have to compare the high 36 bits with every tag, and in our small cache we have 212 which is 4096 cache lines we need to ccheck. This is not going to happen with anything we can build - we could do fast comparison in parallel for somewhere between 8 and maybe 32 or more items, by fanning out each bit into multiple wires and connecting each wire to a separate location - it is just not practical to fan out a wire by more than 4, we would need extra circuitry to boost the current, we can’t just divide it any number of times - performing 4000 simultaneous comparisons requires a tree withtoo many layers, and too much circuit to be practical, and even if we were able to build it it would not be fast enough (because too many layers, it would probably take multiple cycles).

2.2.2. Cache addressing -direct mapped (Requirement 2). Since we have 212 cache line, we need 12 bits to tell us where a cache line is in the cache. Suppose we just read 12 bits from the memory address and use them as the cache address? Let us take our 42 bit memory address and divide it as follows: | tag : 24 bits 41 to 19 | cache index: 12 bits, 18 to 6 | line: 6 bits, 5 to 0 | With this scheme, the memory address completely determines where the cache line will be placed in the cache. Given our memory address, the high 24 bits uniquely determine the location of a 256K area in memory, by setting the bottom 18 bits to 0. The next 23 bits from the memory give uniquely determine a single location for a line in the cache. The bottom 6 bits locate the data in the cache line. We can reconstruct the memory address by joining the cache index to the left of the bottom 6 bits, and then we join the tag to the left to recover the full memory address. Direct mapped addressing is simple and fast - decoding it is just connecting wires to the memory address, just like decoding an instruction in MIPS. It takes essentially no time. Re-assembling the address is the same in reverse - it is jut connecting wires from the 6 bit line position, the 12 bit cache index and the 24 bit tag to a register, the whole thing takes picoseconds, all you need is to transfer bits, no added logic is needed. The downside is, cache utilization. Specifically, each location in main memory has only one place where it must be placed in the cache, in our example with an 8 GB memory, there are 256000 places in memory that correspond to the same cache index. It is quite possible that multiple items in our working set from different memory locations would map to the same cache index, and so we would have to keep copying data we need into a particular cache line only to have to replace it with other data we also need because there is only one place to put it in the cache even when there is otherwise available space.

2.2.3. Cache addressing -set associative (Requirement 2). This is the best compro- mise solution we have - not as fast as direct mapped, but can be done in much less than one cycle, requires more circuitry, but can still be built in practice. The idea is to collect 2, 4, 8 or 16 cache lines in a set. Even a 16-way fan-out can be accomplished in time, so it is possible to compare a tag value to that many different possibilities in parallel in the time we need. Here is how 4-way set associative would work in our example: | tag : 24 bits, 41 to 19 | cache index: 10 bits, 18 to 8 | ignore: bits 7,6 | line: 6 bits, 5 to 0 | 8

This gives us 210 = 1024 locations in cache, but each location stores 4 cache lines and a corresponding 24 bit tag for each cache line. With fewer cache locations, we have more blocks in memory mapping to each set, but with 4 entries in each set we do not have as many cases where we have to remove data we actually need from cache. The net result is better utilization - a larger fraction of the cache fills before we have to start pushing data out. In section 5.4 which comments on the effect of cache associativity on measures like miss rates. Although different cache structures can affect different test codes differently, and it is hard to separate effects of any single aspect of cache design from effects of other differences in hardware, it does appear that set associative caches perform better than direct mapped, and that the improvement due to a larger set is less as the set size increases. (The improvement of 4-way over 2-way is greater than the advantage of 8-way over 4-way). It seems that in terms of cache addressing, the choice is between direct mapped for simplicity or a low level of set associativity

We will now consider how to deal with changes to data in memory and cache, and store operations.

2.2.4. Consistency with memory (Requirement 3). When we do a store operation we are writing a value in a register to memory. We need to consider how we should deal with the cache. We are always going to write the value to cache. When we have a cache, a load operation will always check if the address we want is in cache because it is much faster to load from cache, and we have a reasonable expectation (because locality of reference) that the value will be there. If it is in cache, we do not fetch the value from memory. If the cache is not updated, we will get the wrong value. Writes to memory take essentially the same time as reads, but they are usually done by hardware in parallel with computation, since we do not need to wait for a value being stored, it is in a register. We have two options (I agree, the names are confising, they don’t really describe what is happening): • Write-back: we do not write to memory until we need to remove a cache line to copy something else to the same cache location. Since memory access operations always check the cache first, this does not affect correctness. It means we can read and write repeatedly from the same cache line without accessing memory. We need an extra bit in the cache line in addition to the tag called the dirty bit. This is 0 initially, it is set to 1 if and when we write to that cache line. If the dirty bit is true, then we need to write the entire cache line to memory when we need to remove it from cache (it would add complexity and space to track individual updates to the cache line, and it would not speed up the writes significantly, because we pipeline them the same as loads). Maximizes memory bandwidth use. Shared memory reads by different processes need to check each other’s cache - typically each process has its own cache so updates to cache by one process will not be visible to others. • Write through: when we write a value to cache, we continue and write it to memory. This reduces the time during which values in cache and memory are inconsistent. Writes to memory are smaller, total write time is longer because there is a latency delay on each write, and because we may need 9

multiple writes to the same address in a cache line and memory. Memory bandwith usage is reduced, this may help to me avoid memory bottlenecks in a multiprocessor machine. Shared memory is easier to deal with because the time during which cache and memory are inconsistent is less. There is no clear advantage to write-through or write-back. If we are not using shared memory on a multiprocessor machine, it arguably makes no difference, since writes are handled by parallel hardware and there is usually no need to delay the CPU for a write. The only issue is - when we need to remove a cache line, we need to verify that any writes to memory from data in that line have completed before replacing it. If we are doing write-back we know there are no pending writes if the dirty bit is not set, and we can just replace the line. If the dirty bit is set, then we need to write the line back to memory before replacing it, we may need to take an extra delay in the load instruction or, allow the CPU to continue since the load instruction can place the requested value in a register and update the cache in parallel (because it is done by two different circuits). If the cache line can be written before the next instruction after the load, then no further action is needed, since the cache will be correct on the cycle following the load. Any subsequent reads to an address in the cache line that was just removed need to verify that the write to memory has completed. In write-through, when replacing a cache line, we need to check that any write from that line has completed before replacing it, but after that we know memory is consitent with cache. Personal opinion - I shift back and forth between write-back and write-through. Today I prefer write-through because it helps multi-tasking and shared memory. Other days, I favor write back because it minimizes memory access and requires less work from the cache hardware (and the - more on that later). 2.2.5. Replacement strategy (requirement 4). We need to maximize the advantage we get from locality of reference. Being unable to see the future, we load blocks of memory beyond the address we are directly referencing, in the hopes that some of the extra data in a cache line is stuff we will need soon. The problem we are left with is: the cache is much smaller than memory, at some point we need to get rid of cache lines in order to load new data. Which lines do we get rid of? Direct apped cache does not give us a choice of what to replace, and we can’t build a large (over ~100 items!) fully asociative cache - we will consider set associative cache (our methods may be extended to larger sets up to fully asociative). Consider a 4-way set-associative cache, as in our previous example. If the set is full, and we are loading a new cache line with the same index into the cache, we need to replace one of the lines already in the set. The one we want to replace is the one that we are not going to need, but we don’t know which one it is. We can use the concept of locality of reference to guess. Temporal locality of reference says, if we have just used an address, we are likely to use the same or nearby addresses soon. This suggests that we keep cache lines in order of how recently we have used them - the line we want to replace is the least recently used line, this is the LRU algorithm. In a set of 4, we need 2 reference bits in each line to track the order in which we have referenced them, we use binary 00, 01, 10, 11 to denote the order of use, with 10

00 being the most recenlty used, and 11 being the least. So we would test all four numbers in parallel, whichever line has reference bits = 11 is the one we replace, it becomes 00 and the other lines are incremented by 1 to restore the order. An issue - what happens when we access one of the lines? Suppose we access the line with reference bit 10 - then it becomes 00 (the most recently used) - lines with lower numbers than 10 get incremented, but line 11 is unchanged, it is still the last line. So replacing line with reference = 11 is straightforward, added logic is required for updates to cache lines, because some reference bits need to be incremented and some dont. The logic is trivial for 2 way set associative, with one reference bit, whichever line just got updated or loaded is 0, the other is 1. We always replace the line with reference = 1. Larger associativity increases the complexity of LRU. There are compromises that use fewer bits - for example, a 16 way set associative cache would need 4 reference bits to track the order of access, we might want to use 2 bits and try to divide the lines into 4 subsets which have been more or less recently updated, or into the half of lines less recently updated than the other half. Another complication comes up if we use write-back logic - do we want to preferentially replace lines that do not have the dirty bit set, to minimize writes? A simpler method than LRU is to just randomly select which line to remove. This does not require any reference bits. It is not as bad as your first impression - on 4 way set associative memory, there is only a 1/4 probability that you will replace the most recently used line, whch is the likeliest thing we will need again. In any case locality of reference only gives us an idea of what we will probably use next, it is not a law. In fact, we can imagine situations in which LRU is the worst possible thing to do - consider a loop that goes over multiple arrays, as in matrix multiplication. Our would repeat over a wide range of addresses but we use the range repeatedly - it is quite possible that addresses from one end of our range will be the least recently used, and will be removed from cache just before the loop jumps back to the start and needs them again. Random replacement is not sensitive to patterns of memory acess like LRU might be. Random replacement works better for larger sets, replacement of whichever line you have just used (the worst thing to do from locality of refernce) happens 1/4 of the time in 4 way sets, 1/16 of the time in 16 way sets.

3. Effect of cache on performance Here (again) is an example CPI computation from Lecture 1 notes. Recall that this is based on measurements, because different programs have different fractions of usage for each instruction type, a table such is this would normally be generated from inspection of code and actual runs of a set of benchmarks that reflect the typical mix of programs this machine is expected to run. instruction type cycles fraction CPI integer 1 .4 .4 logical 1 .2 .2 floating point 3 .1 .3 jump 2 .2 .4 memory 3 .1 .3 total 1.0 1.6 As we saw before, performance on a given program can be calculated as 11

P = n ∗ CPI /f ; this is the expected CPU time required for a run n instructions long. Consider a machine where f is 3GHz, and where a 64 byte cache line takes 18 nanoseconds to read - this is the cost of a cache miss, when the address we want is not in cache and we need to load a line from memory In our original table, we arbitrarily assumed a 3 cycle average cost of memory access. Now let us take cache into account. We note that , as described above, writes typically do not incur a delay beyond one cycle, even if we need to write to memory, it will be handled by cache hardware without affecting computation. Our memory entry in the table would split into 2 rows, one for reads and another for writes. The writes would cost one cycle , in calculating the read, we need to consider cache performance. Let us modify the above table by assuming a fraction of .02 for writes and .08 for reads (typically there are more reads than writes). The cycles for writes are just a fixed number 1, the cycles required for reads are an average value which we need to calculate. We previously calculated the cost of a miss as 18 nanoseconds, assuming 10ns latency and reading 64 bytes at a memory bandwidth of 8 bytes/nanosecond. Assume our CPU has a 3GHz clock (typical mid-level desktop machine) - so 1 cycle is 1/3 nanosecond, and 18 nanoseconds is 54 CPU cycles, our cost of a miss. Assume that our hit rate in cache is 90% for L1 cache with 1 cycle access time. Our average access time is (cycles-cost-hit)*(fraction-hit + fraction-miss*cost-miss). Filling in the numbers, we have: Average read cost = 1*(.9 + .1*54) = 6.3 with L1 cache only so our table becomes: instruction type cycles fraction CPI integer 1 .4 .4 logical 1 .2 .2 floating point 3 .1 .3 jump 2 .2 .4 read 6.3 .08 .50 write 1 .02 .02 total 1.0 1.82 Suppose we had an L2 cache - our formula describing cache is modified: (cycles- cost-hit)*(fraction-hit + fraction-L2-hit*cost-L2(fraction-L2-hit+fraction-L2-miss*cost- miss)) - essentially, our L1-miss fraction multiples cost-l2(fraction-hit + fraction- miss*cost-miss) - we rewrite cost-miss in the fist equation so it looks like the original equation for L1, just fill in different numbers. Assume the basic cost in L2 is 3 cy- cles, and the hit rate is 80%. When we fill in the numbers, we get: Average read cost = 1*(.9 + .1*3*(.8 + .2*54))) = 4.38, with L1 and L2 cache, our read CPI goes to .35 and the total CPI becomes 1.67 The program mix assumed here has very little memory access - only 10% of the instruction mix. With a higher percentage of reads, CPI would be worse - with just L1 cache reads are the costliest operation. Notice that with L1 and L2, jumps have more effect on CPI with the current instruction mix - Amdahl’s law is applicable - cache only improves read performance, everything else stays the same. In this example, even a fairly large improvement in read performance - 6.3 cycles to 4.38 - has a small effect on total performance, since reads account for a little less than 1/3 of CPI. Keep in mind, any changes in performance can have unexpected effects - for example, if we increase the clock rate by 50%, (CPU at 4.5 GHz) everything 12 speeds up by that factor - but the CPI gets a lot worse, because with a 50% increase in the clock and reads still taking 18 nanoseconds, instead of this translating to 54 CPU cycles, it becomes 81 cycles of the faster clock. We still get a performance increase, but it will be less than the 50% improvement we expect from speeding up the clock.