Name: ______CS/ECE472 Midterm #2 FALL 2012

NAME: ______Student ID#: ______

OPEN BOOK, OPEN NOTES. You can use material from previous exams, the book, etc. You CANNOT use other people for help.

SHOW YOUR WORK. ANSWERS ONLY will be treated as if cheating may have occurred, ESPECIALLY SINCE THIS IS A TAKEHOME EXAM.

OUT: Nov 26, 2012, 09:00AM-PST IN: Dec 02, 2012, 11:59PM-PST

Your signature is your promise that you have not cheated and will not cheat on this exam, nor will you help others to cheat on this exam. NOTE: If exam is not signed, it will not be scored.

Signature: ______

ABET Requirements: 1. Use various metrics to calculate the performance of a computer system 2. Identify the addressing mode of instructions 3. Determine which hardware blocks and control lines are used for specific instructions 4. Demonstrate how to add and multiply integers and floating-point numbers using twos complement and IEEE floating point representation 5. Analyze clock periods, performance, and instruction throughput of single-cycle, multi-cycle, and pipelined implementations of a simple instruction set 6. Detect pipeline hazards and identify possible solutions to those hazards 7. Show how cache design parameters affect cache hit rate 8. Map a virtual address into a physical address

Question 1 ______(20 points) (ABET #2, #3) Question 2 ______(20 points) (ABET #6) Question 3 ______(20 points) (ABET #5, #1) Question 4 ______(20 points) (ABET #8)

Question 5 ______(20 points)

TOTAL ______1. (15 points) Assume that you have a pipelined implementation of MIPS with a perfect cache. Draw the pipelining diagram for the code along the left side. I have started the diagram for you with the first statement. Indicate all data dependencies by circling the registers in the code. Indicate all necessary forwarding from the output of one functional unit to the input of another functional unit. Indicate all necessary stalls with a nop. Keep everything that happens in the same clock cycle in a straight column. You do not have to shade the active units. Assume BRANCH NOT TAKEN.

TWO MORE DIFFICULT THINGS ADDED to this problem: a) Assume the register file CANNOT do a WRITE and READ in the SAME cycle b) The branch decision has been moved from the EX stage into the ID stage

IF ID EX MA WB lw $1, 16($2) beq $1, $4, NEXT1 add $4, $1, $3 beq $4, $3, NEXT2 2. (10 points) The CPI of a machine with a perfect cache is 2.0 for a certain benchmark.

Now, assume a L1 instruction cache miss rate of 2% and a L1 data cache miss rate of 11%. Assume the miss penalty (for either L1 caches) is 10 cycles. The hit time is 2 cycles (L1 cache). Assume a L2 instruction cache miss rate of 0.5% and a L2 data cache miss rate of 3%. Assume the miss penalty (for either L2 caches) is 800 cycles. The hit time is 10 cycles (L2 cache).

The frequency of data access instructions in this benchmark is 11%.

How much faster is the machine with the perfect cache than the one with the imperfect L1 and L2 caches? Otherwise the machines are identical and running the same code. Express your answer either as a fraction or as a decimal rounded to the hundredths place. ( X.XX) Your answer should be > 1.

SHOW ALL WORK. 3a. (5 points) Assume we have a 64 KB cache, direct-mapped cache. (KB = 2^10 bytes) The block size is 4 words. Given a 32-bit physical address, divide up all the bits and indicate what they are used for to find and access the requested word in the cache. Do NOT draw the cache entries, the mux, the AND gate, ... Just indicate exactly how many bits and which bits of the address are used for each purpose. Label your groups of bits with their purpose. 31 0

3b. (5 points) Assume we have a 64 KB cache, 8-way set associative cache. (KB = 2^10 bytes) The block size is 16 words. Given a 32-bit physical address, divide up all the bits and indicate what they are used for to find and access the requested word in the cache. Do NOT draw the cache entries, the mux, the AND gate, ... Just indicate exactly how many bits and which bits of the address are used for each purpose. Label your groups of bits with their purpose. 31 0

3c. (10 points) DRAW the cache entries, the mux, the AND gate, the valid and dirty bits, LRU, for BOTH designs below (if they are required. You will be penalized if they are NOT required). Also include the decode logic symbolically, so that I know that you understand how to build an address decoder. 4. Assume a 48-bit virtual address and a 32-bit physical address. The page size is 16 KB. How many entries are there in the page table? Express your answer in powers of 2. Show your work for this problem. (KB = 2^10 bytes)

5. One-half page of notes, summarizing either:

1) http://videos.dac.com/46th/wedkey/dally.html 2) http://www.monolithic3d.com/2/post/2011/11/the-dally-nvidia-stanford-prescription-for- exascale-computing.html

Note that (2) is very similar to (1), except that it focuses more on parallel programming and less about CUDA. 0 M Add u x ALU 1 4 Add result Shift RegDst left 2 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read Read register 1 PC address Read data 1 Instruction [20–16] Read Zero Instruction register 2 0 ALU [31–0] Read ALU Read M Write 0 Address 1 u data 2 result data M Instruction Instruction [15–11] x register M memory u u 1 x x Write 1 0 data Registers Data Write memory data 16 32 Instruction [15–0] Sign ALU extend control

Instruction [5–0] Page 8 PCWriteCond PCSource

PCWrite Outputs ALUOp IorD ALUSrcB MemRead Control ALUSrcA MemWrite Op RegWrite MemtoReg [5–0] IRWrite RegDst 0 Jump M address 1 u Shift x Instruction [25-0] 26 28 [31–0] left 2 2 Instruction [31–26] 0 PC [31–28] PC 0 M Instruction Read u Address [25–21] register 1 M x Read u A x 1 Instruction data 1 Memory Read 1 Zero [20–16] register 2 0 MemData ALU ALU Instruction M Registers ALUOut Write result [15–0] Instruction u Read x register 0 Write [15–11] data 2 B data Instruction 1 4 1 M register Write u 0 data 2 x Instruction M 3 u [15–0] x 1 Memory 16 32 ALU data Sign Shift register extend left 2 control

Instruction [5–0]

Page 9