<<

Computer Architecture Memory

Dr. Falah Hassan Ali Part-time Professor uOttawa 2012

Cache memory, also called CPU memory, is random access memory (RAM) that a can access more quickly than it can access regular RAM. Cache memory is typically integrated directly with the CPU chip or placed on a separate chip that has a separate interconnect with the CPU

A CPU cache is a cache used by the (CPU) of a computer to reduce the average time to access from the main memory. The cache is a smaller, faster memory which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.).

Cache memory is a small size memory but it is faster than RAM. Cache is faster because it uses SRAM which contains flip- which is faster than capacitors storing charge in DRAM. Cache are of three types : L1(level 1),L2(level 2),L3(level 3) cache.

L1 cache is inbuilt on the . The processor access RAM once and transfers the block of it in it's cache so processor doesn't need to access RAM again and again it is there on the processor cache. L2 cache can be inbuilt or outside the processor as shown in above image it is faster than RAM but slower than L1 cache. L3 cache is always outside the processor as shown in figure it is faster than RAM but slower than L1 and L2 cache.

The amount of cache memory used in a computer is way more less than RAM and hard drive memory etc. because of it's high cost and, more the cache memory more is the heating on the processor. When an instruction is to be executed by the processor it will first search it in the cache and if it is not found it will search in RAM, this type of cache is called look up cache. Another type of cache is look aside cache which searches simultaneously in RAM and cache. Look aside cache saves more time in searching.

In today's world cache memory are 6,8,12 MB for . In mobiles the cache memory concept is same, but the app's cache data is different. Apps keep cache data to access content faster in future. Cache Line Cache is partitioned into lines (also called blocks). Each line has 4-64 in it. During data transfer, a whole line is read or written.

Each line has a tag that indicates the address in M from which the line has been copied.

Index Tag Data Index Data 0 2 ABC 0 DEF 1 0 DEF 1 PQR 2 ABC 3 XYZ

Cache Main Memory

Cache hit is detected through an associative search of all the tags. Associative search provides a fast response to the query:

“Does this key match with any of the tags”?

Data is read only if a match is found. Types of Cache

1. Fully Associative 2. Direct Mapped 3. Set Associative

Fully Associative Cache

tag data

M-addr key C M

1- No restriction on mapping from M to C.

2- Associative search of tags is expensive. 3- Feasible for very small size caches only. Direct-Mapped Cache

A given memory block can be mapped into one and only cache line. Here is an example of mapping

Cache line Main memory block 0 8 … ,24 ,16 ,8 ,0n 1 8 … ,25 .17 ,9 ,1n+1 2 8 … ,26 ,18 ,10 ,2n+2 3 8 … ,27 ,19 ,11 ,3n+3

Advantage No need of expensive associative search!

Disadvantage Miss rate may go up due to possible increase of mapping conflicts.

Set-Associative Cache C M set 0

set 1

Set 3

Two-way Set-associative cache

N-way set-associative cache Each M-block can now be mapped into any one of a set of N C-blocks. The sets are predefined. Let there be K blocks in the cache. Then

N = 1 Direct-mapped cache N = K Fully associative cache

Most commercial cache have N= 2, 4, or .8 Cheaper than a fully associative cache. Lower miss ratio than a direct mapped cache.

But direct-mapped cache is the fastest.

Specification of a cache memory

Block size 64-4byte Hit time 2-1cycle Miss penalty 32-8cycles Access 10-6cycles Transfer 22-2cycles Miss rate %20-1 Cache size L1 8KB-64KB L2 128KB-2 MB Cache speed L1 0.5ns (8 GB/sec) L*2 0.75ns (6 GB/se)) on-chip cache

What happens to the cache during a write operation?

Write Policies

If data is written to the cache, at some point it must also be written to main memory; the timing of this write is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory, and the cache instead tracks which locations have been written over, marking them as dirty. The data in these locations is written back to the main memory only when that data is evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for the new memory location.

Problem with Direct-Mapped

 Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache

 Can lead to 0% hit rate if more than one block accessed an interleaved manner map to the same index  Assume addresses A and B have the same in index but different tag bits  A, B, A, B, A, B, A, B, …  conflict in the cache index

 All accesses are conflict misses

Set Associativity

Associative memory within the set tag store --More complex, slower access, larger

+Accommodates conflicts better (fewer conflict misses)

7 Modern processors have multiple interacting caches on chip. The operation of a particular cache can be completely specified by: •the cache size •the cache block size •the number of blocks in a set •the cache set replacement policy •the cache write policy (write-through or write-back) While all of the cache blocks in a particular cache are the same size and have the same associativity, typically "lower-level" caches (such as the L1 cache) have a smaller size, have smaller blocks, and have fewer blocks in a set, while "higher-level" caches (such as the L3 cache) have larger size, larger blocks, and more blocks in a set. Question : In a certain system the main memory access time is 100 ns. The cache is 10 time faster than the main memory and uses the write though protocol. If the hit ratio for read request is 0.92 and 85% of the memory requests generated by the CPU are for read, the remaining being for write; find the average time considered for read and write requests ? memory access time = 100ns, so cache access time would be = 10 ns (10 time faster). In order to find avg. time, we have a formula Tavg = hc+(1-h)M where h = hit rate, (1-h) = miss rate, c = time to access information from cache, M = miss penalty (time to access main memory) Write through operation : cache location and main memory location is updated simultaneously. It is given that 85% request generated by CPU is read request and 15% is write request. Tavg = 0.85(avg time for read request)+ 0.15(avg time for write request) = 0.85(0.92*10+0.08*100)+0.15(avg time for write request) //* 0.92 is a hit ratio for read request , but hit ratio for write request is not given ?? If I assume that hit ratio for write request is same as hit ratio for read request then, = 0.85(0.92*10+0.08*100)+0.15(0.92*(10+100)+0.08*100) =31 ns If I assume that hit ratio is 0% for write request then, = 0.85(0.92*10+0.08*100)+0.15(0*110+1*100) =29.62 ns Avg access time considering only read = 0.92*10 + 0.08*100 = 17.2 ns. Avg access time considering only write = 100 ns (because in write through you have to go back to memory to update even if it is a hit or miss. if you assume hit ratio = 0.5 and miss = 0.5 then 0.5*100 + 0.5*100 = 1*100) So, total access time for both read and write would be - 0.85*17.2 + 0.15*100 = 14.62 + 15 = 29.62 ns **you cant assume hit ratio for write same as hit ratio for read. for write request (write through) whatever will be the case, you have to write back in the memory. so the write access time will be equal to memory access time. Question 1. What advantage does a Harvard cache have over a unified cache?

A Harvard (split) cache permits the processor to access both an instruction word and a data word in a single cache memory cycle. This can significantly increase the performance of the processor. The only real drawback of a Harvard cache is that the processor needs two memory busses (two complete sets of and data lines), one to each of the caches. However, when the Harvard cache is implemented within the processor chip, the separate memory busses never have to escape the chip, being implemented completely in the metal interconnections on the chip. Question 2. Why would you want the system I/O to go directly to main memory rather than through the processor cache?

If the system I/O went through the processor cache, it would cause contention with the processor trying to use the cache. The processor-to-cache bus is almost %100utilized in most implementations, and any time the system I/O tried to use it the processor would have to wait. On the other hand, the main memory bus in a well-designed uniprocessor system should have a much lower utilization. With a cache hit rate of 97% and a 10 to 1 difference in main memory access to cache access times, the main memory bus is about 30% utilized due to the processor.

Question 3 .If an 8-way set-associative cache is made up of 32 words, 4 words per line and 4096 sets, how big is the cache in bytes?

We convert words/line to bytes/line = 4 bytes/word x 4 words/line = 16 bytes/line. Cache size is LKN = 16 x 8 x 4096 = 512k bytes.

Question 4 .If a memory system consists of a single external cache with an access time of 20 ns and a hit rate of 0.92, and a main memory with an access time of 60 ns, what is the effective memory access time of this system?

t(eff) = 20 + (0.08)(60) = 24.8 ns Question 5 .We now add to the system described in question 9. The TLB is implemented internal to the processor chip and takes 2 ns to do a translation on a TLB hit. The TLB hit ratio is 98%, the segment table hit ratio is 100% and the hit ratio is 50%. What is the effective memory access time of the system with virtual memory?

teff = tTLB + (1 – hTLB)(tSEG + tPAGE) + tCACHE + (1 - hCACHE)tMAIN

teff = 2 + 0.02(20 + 20 + 0.5(60)) + 20 + (0.08)(60) = 28.2 ns

This represents a drop in performance of (28.2 – 24.8)/24.8 = .14% Question 6 .LRU is almost universally used as the cache replacement policy. Why?

It is easy to implement and has been proven to be the best predictor of future use behind the OPT algorithm. However, OPT is not achievable, requiring perfect knowledge of future use?. Question 7 .What is the advantage of using a write-back cache instead of a write- through cache?

A write-back cache significantly reduces the utilization (and thus contention) on the main memory bus because programs normally write several times to the same cache line (and even the same word). With a write-through cache, we would have to use main memory cycles to write the data to main memory repeatedly. With a write-back cache, we only need to write the line when we displace it because of some other cache miss. This is, on average, the miss rate of the cache. For example, if the miss rate of the cache is 0.03, we would have to write back the line only 3 out of 100 times, significantly reducing contention on the main memory bus. This all assumes that we have a deep enough FIFO that the processor cache doesn't have to wait for the writes to complete.

Question 8 .What is a disadvantage of using a write-allocate policy with a write-through cache?

The write-allocate policy means that I allocate a line in the cache on a write miss. This is opposed to the write no-allocate policy where I just write around the cache directly to main memory on a write miss. The disadvantage of allocating a line in the cache for a write miss is that I need to fill the rest of the cache line from main memory. This means that I need to read from main memory to fill the line as well as update the one word that was written by the processor in both the cache and main memory. When I am using a write-back cache, I do want to go ahead and allocate the cache slot in anticipation of having several more write hits where I don't go to main memory at all.

Question 9 .What is "bus-snooping" used for?

Bus snooping is where the processor cache monitors the memory bus for others addressing memory at locations that are being held in the cache. When it detects some other (probably I/O) writing into the main memory location that is being held in the cache, it invalidates the corresponding cache line so that main memory and the cache can be kept in sync . Question 10 .You find that it would be very inexpensive to implement small, direct- mapped cache of 32K bytes with an access time of 30 ns. However, the hit rate would be only about 50%. If the main memory access time is 60 ns, does it make sense to implement the cache?

teff with no cache is 60 ns. teff with the cache is 30 + (.5)(60) = 60 ns. It didn't get any better with the cache, so we shouldn't implement it. Actually, we would need to use memory interleaving to achieve the 60 ns without the cache because the cycle time of DRAM is about twice as long as the access time. The following questions is about chapter "Cache Memory:" 1- A set-associative cache consists of 64 lines, or slots, divided into four-line sets. Main memory contains 4K blocks of 128 words each. Show the format of main memory addresses. Answer: The cache is divided into 16 sets of 4 lines each. Therefore, 4 bits are needed to identify the set number. Main memory consists of 4K= 212 blocks. Therefore, the set plus tag lengths must be 12 bits and therefore the tag length is 8 bits. Each block contains 128 words. Therefore, 7 bits are needed to specify the word.

TAG SET WORD Main memory address= 8 4 7

2- A two-way set-associative cache has lines of 16 bytes and a total size of 8 kbytes. The 64-Mbyte main memory is addressable. Show the format of main memory addresses. Answer: There are a total of 8 kbytes/16 bytes = 512 lines in the cache. Thus the cache consists of 256 sets of 2 lines each. Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26- bit address is needed. Main memory consists of 64-Mbyte/16 bytes = 222 blocks. Therefore, the set plus tag lengths must be 22 bits, so the tag length is 14 bits and the word field length is 4 bits. TAG SET WORD Main memory address= 14 8 4

.4 Consider a machine with a byte addressable main memory of 216 bytes and block size of 8 bytes. Assume that a direct mapped cache consisting of 32 lines is used with this machine. a. How is a 16-bit memory address divided into tag, line number, and byte number? b .How many total bytes of memory can be stored in the cache? c .Why is the tag also stored in the cache?

Answer: a. 8leftmost bits = tag; 5 middle bits = line number; 3 rightmost bits = byte number. TAG LINE WORD 8 5 3

b. 256bytes.

c. Because two items with two different memory addresses can be stored in the same place in the cache. The tag is used to distinguish between them. 5- Consider a memory system that uses a 32-bit address to address at the byte level, plus a cache that uses a 64-byte line size. a. Assume a direct mapped cache with a tag field in the address of 20 bits. Show the address format and determine the following parameters: number of addressable units, number of blocks in main memory, number of lines in cache, size of tag. b. Assume an associative cache. Show the address format and determine the following parameters: number of addressable units, number of blocks in main memory, number of lines in cache, size of tag. c. Assume a four-way set-associative cache with a tag field in the address of 9 bits. Show the address format and determine the following parameters: number of addressable units, number of blocks in main memory, number of lines in set, number of sets in cache, number of lines in cache, size of tag. Answer: a. Address format: Tag = 20 bits; Line = 6 bits; Word = 6 bits. Number of addressable units= 2s+w = 232 bytes; number of blocks in main memory = 2s= ;226 Number of lines in cache 2r = 26 = 64; size of tag = 20 bits.

b. Address format: Tag = 26 bits; Word = 6 bits. Number of addressable units= 2s+w 2 32 = bytes; number of blocks in main memory= 2=s ;226 Number of lines in cache = undetermined; size of tag = 26 bits.

c. Address format: Tag = 9 bits; Set = 17 bits; Word = 6 bits. Number of addressable units= 2s+w 2 32 = bytes; Number of blocks in main memory= 2=s ;226 Number of lines in set = k = 4; Number of sets in cache = 2d = 217; Number of lines in cache = k× 2d =219; Size of tag = 9 bits. .6Consider a computer with the following characteristics: total of 1Mbyte of main memory; word size of 1 byte; block size of 16 bytes; and cache size of 64 Kbytes. a. For the main memory addresses of F0010, 01234, and CABBE, give the corresponding tag, cache line address, and word offsets for a direct-mapped cache. b. Give any two main memory addresses with different tags that map to the same cache slot for a direct-mapped cache. c. For the main memory addresses of F0010 and CABBE, give the corresponding tag and offset values for a fully-associative cache. d. For the main memory addresses of F0010 and CABBE, give the corresponding tag, cache set, and offset values for a two-way set-associative cache. Answer: a. Because the block size is 16 bytes and the word size is 1 byte, this means there are 16 words per block. We will need 4 bits to indicate which word we want out of a block. Each cache line/slot matches a memory block. That means each cache line contains 16 bytes. If the cache is 64Kbytes then 64Kbytes/16 = 4096 cache lines. To address these 4096 cache lines, we need 12 bits (= 212 .)4096 Consequently, given a 20 bit (1 MByte) main memory address: Bits 0-3 indicate the word offset (4 bits.) Bits 4-15 indicate the cache line/slot (12 bits). Bits 16-19 indicate the tag (remaining bits.) F0010 = 1111 0000 0000 0001 0000 Word offset = 0000 = 0 line = 0000 0000 0001 = 001 Tag = 1111 = F 0100 0011 0010 0001 0000 = 01234 Word offset = 0100 = 4 Line = 0001 0010 0011 = 123 Tag = 0000 = 0

CABBE = 1100 1010 1011 1011 1110 Word offset = 1110 = E Line = 1010 1011 1011 = ABB Tag = 1100 = C

b. We need to pick any address where the line is the same, but the tag (and optionally, the word offset) is different. Here are two examples where the line is 1111 1111 1111 Address :1 Word offset = 1111 Line = 1111 1111 1111 Tag = 0000 Address = 0FFFF Address :2 Word offset = 0001 Line = 1111 1111 1111 Tag = 0011 Address = 3FFF1 c. With a fully associative cache, the cache is split up into a TAG and a WORDOFFSET field. We no longer need to identify which line a memory block might map to, because a block can be in any line and we will search each cache line in parallel. The word-offset must be 4 bits to address each individual word in the 16-word block. This leaves 16 bits leftover for the tag. F0010 Word offset = 0h Tag = F001h CABBE Word offset = Eh Tag = CABBh d. As computed in part a, we have 4096 cache lines. If we implement a two –way set associative cache, then it means that we put two cache lines into one set. Our cache now holds 4096/2 = 2048 sets, where each set has two lines. To address these 2048 sets we need 11 bits (211 = 2048). Once we address a set, we will simultaneously search both cache lines to see if one has a tag that matches the target. Our 20-bit address is now broken up as follows: Bits 0-3 indicate the word offset Bits 4-14 indicate the cache set Bits 15-20 indicate the tag F0010 = 1111 0000 0000 0001 0000 Word offset = 0000 = 0 Cache Set = 000 0000 0001 = 001 Tag = 11110 = 1 1110 = 1E CABBE = 1100 1010 1011 1011 1110 Word offset = 1110 = E Cache Set = 010 1011 1011 = 2BB Tag = 11001 = 1 1001 = 19