Midterm I SOLUTIONS March 30Th, 2011 CS252 Graduate Computer Architecture

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2011 John Kubiatowicz Midterm I SOLUTIONS March 30th, 2011 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score 1 20 2 30 3 30 4 20 Total 100 CS252 Graduate Computer Architecture Spring 2011 [ This page left for ] 3.141592653589793238462643383279502884197169399375105820974944 2 CS252 Graduate Computer Architecture Spring 2011 Question #1: Short Answer [20pts] Problem 1a[3pts]: Explain the difference between implicit and explicit register renaming. Can an architecture with Tomasulo scheduling use explicit register renaming? Explain. Implicit register renaming occurs as a consequence of scheduling, as evidenced by the original Tomasulo algorithm for the 360/91. It changes programmer-visible registers into internal tags or values. Explicit register renaming replaces user-visible register names with internal, physical register names. Often explicit register renaming is performed in a separate stage between the fetch stage and decoding / scheduling stages. Yes, an architecture with Tomasulo scheduling could use explicit register renaming on the entrance to the pipeline (to remove WAR/WAW hazards), then use reservation-stations to schedule instructions (removing RAW hazards); the scheduling would work with physical register names instead of tags. Problem 1b[3pts]: What are precise exceptions? Explain how precise exceptions are achieved in an architecture with explicit register renaming (such as the R10000 or 21264, both discussed in class)? Precise Exceptions are interruptions in the flow of instructions with a single identifiable instruction such that (1) all instructions prior to that instruction have committed their state and (2) neither the identified instruction nor any following instructions have altered the state of the machine. Machines with explicit register renaming can restore the state of the machine to a precise exception point by restoring the register rename mapping table (and freelist) to the value it had at the precise exception point. (i.e. such machines do not have to restore actual register values, since the correct values are still in the physical register file). Problem 1c[3pts]: What is simultaneous multithreading? What is the basic argument for using simultaneous multithreading? Simultaneous Multithreading is a multithreading technique that mixes instructions from multiple threads into a single superscalar pipeline, thereby getting higher utilization of the functional units. Instructions from different threads can be issued in the same cycle to the pipeline (the “simultaneous” part of SMT). The basic justification for simultaneous multithreading is that high- issue-rate processors typically have a lot of wasted issue slots. Problem 1d[2pts]: What was one of the main insights of the RISC work? Hint: simply reducing instructions was a consequence, not an insight. One of the most important lessons of the RISC work was that computer designers should consider the whole picture (including both software and hardware layers) rather than micro-optimizing any one layer (such as the hardware). RISC also placed a strong focus on measurement and benchmarks as providing objective mechanisms for evaluating performance. 3 CS252 Graduate Computer Architecture Spring 2011 Problem 1e[3pts]: Yeh and Patt described two different types of branch predictors: those that exploit correlation between successive branches (global history) and those that exploit the past behavior of each branch in isolation (local history). Does it make sense to combine these ideas in some way? Why or why not? How would you combine them? Explain with a block diagram. Yes, it makes sense to combine both GA and PA styles of branch prediction; we read at least two papers that showed that different branches are best handled by different algorithms (because they have different behavior). The easiest way to combine them is to use a tournament branch predictor the uses one predictor to determine the type of prediction to use for each branch, then uses two other predictors to perform each type of prediction. A generic diagram for a tournament predictor is as follows: addr history chooser Predictor A Predictor B Problem 1f[3pts]: Why does it make sense to use trace scheduling with a VLIW? What is the structure of the resulting code? Is there any relationship between trace scheduling and branch prediction? Since a VLIW is statically scheduled, every operation that is put into a single instruction must be able to actually be run in parallel (no stalling). Unfortunately, with typical programs, there are very few instructions in each block of code from which to choose a set of operations to run in parallel (typical code has a branch every 5 instructions). Trace scheduling picks a “high-probability” trace through the code and schedules all of the instructions along the trace as if they were in the same block (i.e. no branches). The resulting code has a lot of instructions running in parallel, with many potential exit points (where the original branches might disagree with the selected trace). The resulting code executes blocks of instructions followed by “fixup code” to handle undoing the result of incorrectly executed instructions along the trace. Yes, there is a relationship between trace scheduling and branch prediction: the trace represents a group of co-predicted branches (they are predicted to follow the trace). Problem 1g[3pts]: Why is it advantageous to speculate on memory dependencies between load and store instructions? Describe two techniques for performing such speculation. Most of the time, loads do not depend on stores that are in the pipeline at the same time. Thus, a conservative strategy that never launches loads until all dependencies are computed (i.e. addresses are known) can potentially lose performance (especially since we would like to start loads early to deal with cache misses). Assuming that one has a mechanism for recovering from mis-speculation (i.e. ROB), two possibilities are : (1) guess that loads with unknown dependencies never depend on prior stores (“naïve speculation”) or (2) try to learn the dependencies between loads and stores by starting with naïve speculation, then keeping extra information when such speculation fails (using extra bits in the cache or “store-sets” to track which stores a given load depends on). 4 CS252 Graduate Computer Architecture Spring 2011 EX1 EX2 EX3 … … F1/F2 D RROBOp FP1 FP2 FP3 FP4 FP5 Execution Pipeline 1 EX1 EX2 EX3 (Explicit) and Fetch and Initial Decode Initial Execution Pipeline 2 Selection Logic Selection Operand Operand Fetch Ready Instruction Register Writeback Reorder Buffer and Register Renaming Ready Branch Target Buffer Branch Target Renamed EX EX EX … Instructions Instructions 1 2 3 Addr MEM1 MEM2 TagC Execution Pipeline 3 Problem #2: Out-of-Order Superscalar Processor [30pts] In this problem, we will explore a three-way superscalar processor. As shown in the above figure, this processor has two execution pipelines – each of which is slightly different. All three pipelines can execute Integer operations in a single cycle (such as arithmetic ops or branch conditions) and can execute Galois Field operations in 3 cycles. In addition, Pipeline 1 can execute floating point operations: addf (2 cycles), multf (3 cycles), divf (5 cycles). Further, Pipeline 3 can execute memory operations in 4 cycles, the first cycle of which computes the address, the second two cycles of which perform the actual data cache operation, and the fourth cycle of which performs a “Tag Check” to match the current memory operation against the data cache. All operations are fully pipelined. Thus, for instance, multiple load operations can be in different stages of Pipe 3 at the same time. The earlier part of the pipeline, namely instruction management consists of 6 stages: Stages F1 and F2 handle fetching up to three instructions at a time from the instruction cache and are completely pipelined (the instruction cache takes two cycles to fetch instructions). A Branch Target Buffer (BTB) helps to compute the next address. Stage D begins decoding the instructions and computing target addresses Stage R explicitly renames registers for three instructions at a time. This stage also performs an initial instruction decode and checks the instruction cache tags. Stage ROB is a Reorder buffer and Ready Instruction Selection stage. The ROB stage can select up to three instructions to pass onto the Op stage. The ROB stage must make sure that the instructions that it selects are ready to go (i.e. will have all arguments by the time they exit the Op stage) and that they can all execute simultaneously (e.g. it cannot select two floating point instructions in same cycle). Stage Op fetches operands for three instructions from the register file and/or other stages of the processor. It further decodes the instructions. Problem 2a[2pts]: What is the minimum number of instructions that we must flush on a branch misprediction? Explain, assuming that branches are completed in the EX1 stage. If we assume that the branch is the latest instruction in its group of three fetched instructions (i.e. no instructions were fetched after the branch during the cycle in which we fetched the branch), then the answer is 18: if the branch makes it through execution in the fastest possible number of cycles, then it will be done in 7 cycles (from fetch to execution). This 7 cycles includes 6×3=18 instructions after the branch.

Load more