Fast Cache Hit by Virtual Cache Fast Cache Hits by Avoiding Translation

Fast Cache Hit by Virtual Cache Physical Cache – physically indexed and physically tagged cache Virtual Cache – virtually indexed and virtually tagged cache Lecture 18: VLIW and EPIC Must flush cache at process switch, or add PID Must handle virtual address alias to identical physical address V-addr Virtual Static superscalar, VLIW, EPIC and V-addr Physical cache Itanium Processor TLB cache P-index (First introduce fast and high- P-addr bandwidth L1 cache design) P-tag P-index TLB V-tag data PID P-addr P-tag data To L2 =? =? 1 2 Fast Cache Hits by Avoiding Virtually Indexed, Physically Tagged Translation: Process ID impact Cache What motivation? Black is uniprocess Fast cache hit by V-addr Light Gray is multiprocess parallel TLB access V-index No virtual cache when flush cache shortcomings TLB P-tag data Dark Gray is multiprocess P-addr when use Process ID tag How could it be correct? =? Y axis: Miss Rates up to Require cache way size 20% <= page size; now physical index is from X axis: Cache size from 2 page offset What if want bigger caches? KB to 1024 KB Then virtual and physical indices are Higher associativity identical ⇒ works like a moves barrier to right physically indexed Page coloring cache! 3 4 Pipelined Cache Access Pipelined Cache Access For multi-issue, cache bandwidth affects Alpha 21264 Data cache design effective cache hit time Queueing delay adds up if cache does not have The cache is 64KB, 2-way associative; enough read/write ports cannot be accessed within one-cycle Pipelined cache accesses: reduce cache cycle One-cycle used for address transfer and time and improve bandwidth data transfer, pipelined with data array Cache organization for high bandwidth access Duplicate cache Banked cache Cache clock frequency doubles processor Double clocked cache frequency; wave pipelined to achieve the Key technology: Wave pipelining that does not speed need latches 5 6 1 Trace Cache Two Paths to High ILP Trace: a dynamic sequence of instructions including taken branches Modern superscalar processors: dynamically scheduled, speculative execution, branch Traces are dynamically constructed by prediction, dynamic memory disambiguation, processor hardware and frequently used non-blocking cache => More and more traces are stored into trace cache hardware functionalities AND complexities Example: Intel P4 processor, storing about Another direction: Let complier take the 12K mops complexity Simple hardware, smart compiler (End of cache) Static Superscalar, VLIW, EPIC 7 8 MIPS Pipeline with pipelined multi-cycle More Hazards Detection and Forwarding Operations Assume checking hazards at ID (the simple way) Page A-50, Figure A.31 Structural hazards EX Hazards at WB: Track usages of registers at ID, stall instructions if hazards is detected Separate int and fp registers to reduce hazards M1 M2 M3 M4 M5 M6 M7 RAW hazards: Check source registers with all EX stages except the last ones. IF ID M WB A dependent instruction must wait for the producing A1 A2 A3 A4 instruction to reach the last stage of EX Ex: check with ID/A1, A1/A2, A2/A3, but not A4/MEM. WAW hazards DIV Instructions reach WB out-of-order check with all multi-cycle stages (A1-A4, D, M1-M7) for the same dest register Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Out-of-order completion complicates the Add, unpipelined DIV. maintenance of precise exception In-order execution, out-of-order completion Tomasulo w/o ROB: out-of-order execution, out-of-order More forwarding data paths completion, in-order commit 9 10 Complier Optimization Loop unrolling Example: add a scalar to a vector: 1 cycle stall for (i=1000; i>0; i=i–1) 1 Loop:L.D F0,0(R1) 2 cycles stall x[i] = x[i] + s; 2 ADD.D F4,F0,F2 3 S.D 0(R1),F4 ;drop DSUBUI & BNEZ 4L.DF6,-8(R1) MIPS code 5 ADD.D F8,F6,F2 6S.D-8(R1),F8 ;drop DSUBUI & BNEZ Loop:L.D F0,0(R1) ;F0=vector element 7 L.D F10,-16(R1) stall for L.D, assume 1 cycles 8 ADD.D F12,F10,F2 9S.D-16(R1),F12 ;drop DSUBUI & BNEZ ADD.D F4,F0,F2 ;add scalar from F2 10 L.D F14,-24(R1) stall for ADD, assume 2 cycles 11 ADD.D F16,F14,F2 S.D 0(R1),F4 ;store result 12 S.D -24(R1),F16 13 DSUBUI R1,R1,#32 ;alter to 4*8 DSUBUI R1,R1,8 ;decrement pointer 14 BNEZ R1,LOOP BNEZ R1,Loop ;branch R1!=zero 15 NOP stall for taken branch, assume 1 cycle 11 12 2 Unrolled Loop That Minimizes Stalls Register Renaming 1 Loop:L.D F0,0(R1) 1 Loop:L.D F0,0(R1) 1 Loop:L.D F0,0(R1) 2 ADD.D F4,F0,F2 2 ADD.D F4,F0,F2 2 L.D F6,-8(R1) Called code 3 S.D 0(R1),F4 3 S.D 0(R1),F4 3 L.D F10,-16(R1) 4L.DF0,-8(R1) 4L.DF6,-8(R1) 4 L.D F14,-24(R1) movement 5 ADD.D F4,F0,F2 5 ADD.D F8,F6,F2 5 ADD.D F4,F0,F2 6S.D-8(R1),F4 6S.D-8(R1),F8 6 ADD.D F8,F6,F2 Moving store past 7L.DF0,-16(R1) 7L.DF10,-16(R1) 7 ADD.D F12,F10,F2 DSUBUI 8 ADD.D F4,F0,F2 8 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 Moving loads before 9S.D-16(R1),F4 9S.D-16(R1),F12 9 S.D 0(R1),F4 10 L.D F0,-24(R1) 10 L.D F14,-24(R1) 10 S.D -8(R1),F8 stores 11 ADD.D F4,F0,F2 11 ADD.D F16,F14,F2 11 S.D -16(R1),F12 12 S.D -24(R1),F4 12 S.D -24(R1),F16 12 DSUBUI R1,R1,#32 13 DSUBUI R1,R1,#32 13 DSUBUI R1,R1,#32 13 BNEZ R1,LOOP 14 BNEZ R1,LOOP 14 BNEZ R1,LOOP 14 S.D 8(R1),F16 ; delayed branch slot 15 NOP 15 NOP Original register renaming 13 14 VLIW: Very Large Instruction Word VLIW Example: Loop Unrolling Memory Memory FP FP Int. op/ Clock Static Superscalar: hardware detects hazard, complier reference 1 reference 2 operation 1 op. 2 branch determines scheduling L.D F0,0(R1) L.D F6,-8(R1) 1 VLIW: complier takes both jobs L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 Each “instruction” has explicit coding for multiple L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 operations ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 There is no or only partial hardware hazard detection S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 No dependence check logic for instruction issued at the same S.D -0(R1),F28 BNEZ R1,LOOP 9 cycle Wide instruction format allows theoretically high ILP Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Tradeoff instruction space for simple decoding Average: 2.5 ops per clock, 50% efficiency The long instruction word has room for many operations But have to fill with NOOP if no enough operations are found Note: Need more registers in VLIW (15 in this example) 15 16 Problems with First Generation VLIW Scheduling Across Branches Increase in code size Local scheduling or basic block scheduling Wasted issue slots are translated to no-ops in instruction encoding Typically in a range of 5 to 20 instructions About 50% instructions are no-ops Unrolling may increase basic block size to facilitate The increase is from 5 instructions to 45 scheduling instructions! However, what happens if branches exist in loop body? Operated in lock-step; no hazard detection HW Any function unit stalls → The entire processor stalls Global scheduling: moving instructions across Compiler can schedule around function unit stalls, but branches (i.e., cross basic blocks) how about cache miss? We cannot change data flow with any branch outputs Compilers are not allowed to speculate! How to guarantee correctness? Binary code compatibility Increase scheduling scope: trace scheduling, Re-compile if #FU or any FU latency changes; not a superblock, predicted execution, etc. major problem today 17 18 3 EPIC/ IA-64: Motivation in 1989 EPIC, IA-64, and Itanium “First, it was quite evident from Moore's law that it would soon be possible to fit an entire, highly EPIC: Explicit Parallel Instruction parallel, ILP processor on a chip. Computing, an architecture framework Second, we believed that the ever-increasing complexity of superscalar processors would have a proposed by HP negative impact upon their clock rate, eventually leading to a leveling off of the rate of increase in microprocessor performance.” IA-64: An architecture that HP and Intel Schlansker and Rau, Computer Feb. 2000 developed under the EPIC framework Obvious today: Think about the complexity of P4, 21264, and other superscalar processor; processor complexity has been discussed in many Itanium: The first commercial processor papers since mid-1990s that implements IA-64 architecture; now Agarwal et al, "Clock rate versus IPC: The end of the road for conventional microarchitectures," ISCA Itanium 2 2000 19 20 Example: IF-conversion EPIC Main ideas Example of Itanium Compile does the scheduling code: Permitting the compiler to play the cmp.eq p1, p2 = rl, r2;; statistics (profiling) (p1) sub r9 = r10, r11 Hardware supports speculation (p2) add r5 = r6, r7 Addressing the branch problem: predicted execution and many other techniques Addressing the memory problem: cache specifiers, prefetching, speculation on memory alias Use of predicated execution to perform if- conversion: Eliminate the branch and produces just one basic block containing operations guarded by the appropriate predicates.

Load more