
D ynam ically S cheduled Processors CS/ECE 752 Chapter 3 • The dynamic scheduling (superscalar) mindset Instruction Level Parallelism and its • Control and Data dependences Dynamic Exploitation • Progression of performance increase Instructor: Prof. Wood • Multiple instruction issue • Basic out-of-order execution • Precise interrupt maintenance Computer Sciences Department University of Wisconsin • Speculative execution • Branch prediction • Register renaming Slides developed by Profs. Falsafi, Hill, Smith, Sohi, Vijaykumar, and Wood of Carnegie Mellon University, Purdue University, and University of Wisconsin • Case studies Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 1 2 T he Problem and M indset T he Problem and M indset, contd. • Start with a static program • Implement the execution schedule ❒ program written assuming sequential execution • Carry out above at desired speed • Sequence through static program to obtain dynamic ❒ use overlap techniques to improve speed sequence of operations • Maintain the appearance of sequential (serial) execution ❒ operations operate on values (present in storage locations) and create new values • Develop a schedule to execute operations in dynamic sequence • Operations constitute an instruction window ❒ instruction window is a portion of the dynamic instruction stream Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 3 4 1 T he B ig Picture D ependences Program Form Processing Phase • Determine the order of operations in an execution schedule Static program Instruction fetch & • Two types: control dependences and data dependences branch prediction Dynamic instruction Dependence checking & stream dispatch Instruction issue Execution window Instruction execution Completed instructions Instruction reorder & commit Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 5 6 Control D ependences Control D ependence E xam ple • True control dependences caused by flow of control through L2: program move r3, r7 # &a[i] lw r8, (r3) # load a[i] • Artificial control dependences caused by sequential nature of add r3, r3, 4 # &a[i + 1] for (i = 0; i < last; i++) { lw r9, (r3) # load a[i + 1] writing code: PC determines order in which instructions are to if (a[i] > a[i + 1]) { ble r8, r9,L3 # invert a[i] > a[i+1] be executed temp = a[i]; a[i] = a[i + 1]; move r3, r7 # &a[i] • Must be overcome to do multiple issue or out-of-order issue a[i + 1] = temp; sw r9, (r3) # store a[i] change++; add r3, r3, 4 # &a[i + 1] } sw r8, (r3) # store a[i + 1] } add r5, r5, 1 # change++ L3: add r6, r6, 1 # i++ add r7, r7, 4 # &a[i] blt r6, r4, L2 # i < last Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 7 8 2 D ata D ependences D ata D ependence E xam ple L2: move r3, r7 • Read after Write (RAW): true dependence lw r8, (r3) • Write after Read (WAR): anti-dependence add r3, r3, 4 lw r9, (r3) • Write after Write (WAW): output-dependence ble r8, r9.L3 WAW RAW move r3, r7 • Only RAW dependences need to be observed sw r9, (r3) add r3, r3, 4 WAR • WAR and WAW are an artifact of storage name used sw r8, (r3) ❒ also called name dependences add r5, r5, 1 ❒ can be overcome with storage renaming L3: add r6, r6, 1 add r7, r7, 4 blt r6, r4, L2 Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 9 10 Converting Control to D ata D ependences Exam ple L2: move r3, r7 # &a[i] • Associate condition operand (guard) with each instruction lw r8, (r3) # load a[i] add r3, r3, 4 # &a[i + 1] • Semantics of instruction: regular semantics if guard is TRUE, NOP if guard is FALSE lw r9, (r3) # load a[i + 1] set_gt r1, r8, r9 # r1 = (a[i] > a[i + 1]) • Data dependence between instruction setting guard and instructions using guard move_c r1, r3, r7 # &a[i] sw_c r1, r9, (r3) # store a[i] • Called if-conversion add_c r1, r3, r3, 4 # &a[i + 1] sw_c r1, r8, (r3) # store a[i + 1] add_c r1, r5, r5, 1 # change++ L3: add r6, r6, 1 # i++ add r7, r7, 4 # &a[i] blt r6, r4, L2 # i < last Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 11 12 3 Converting D ata to Control D ependences Execution S chedules • Can be done, but not important (yet) for our purpose • Given a set of operations to be executed, along with a dependence relationship, how to develop an “execution • Called reverse if-conversion schedule”? • Dependence relationships and operation latencies determine shape of schedule • “Better” schedule is one that takes a fewer number of time steps (i.e., is more parallel) • Reducing operation latencies allows creation of better schedules • Relaxing dependence constraints allows creation of better schedules Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 13 14 Execution S chedule E xam ples Progression of Perform ance Increase • Serial schedule • Serial processing ❒ • Pipelined schedule one operation at a time • • Parallel schedule Pipelining to overlap operation execution ❒ initiate one operation at a time, but overlap processing of • Parallel pipelined schedule multiple operations • Out-of-order schedule • Pipelining with out-of-order execution of instructions ❒ allow operations to execute in an order different from which they were initiated ❒ target artificial control and data dependences Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 15 16 4 Progression of Perform ance Increase In-O rder Execution • Parallel pipelines, i.e., multiple instruction issue Code fragment ❒ initiate multiple operations at a time, and overlap processing of divf f0, f2, f4 multiple operations addf f10, f0, f8 ❒ target artificial control dependences multf f7, f8, f14 ❒ the “original” superscalar Problem ❍ • Parallel pipelines with out-of-order execution of instructions addf stalls due to RAW hazard ❍ multf stalls because addf stalls ❒ target artificial data and control dependences 1 2 3 4 5 6 7 8 9 10 • Give the scheduler more choice of operations to choose from ❒ use speculation to “overcome true dependences” divf IF ID E/ E/ E/ E/ MEM WB ❒ modern OOO superscalar processor addf IF ID -- -- -- E+ MEM WB multf IF -- -- -- ID EX MEM WB In-order execution limits performance! Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 17 18 Instruction R e-ordering (S cheduling) D ynam ic vs. S tatic S cheduling • Eliminate false control dependences Solutions ❒ change order of instruction execution (statically or dynamically) ❒ dynamic vs. static scheduling • Eliminate false data (storage) dependences ❒ storage renaming Static scheduling (software) • Two problems/solutions go hand in hand ❍ compiler reorganizes instructions ❍ simpler hardware ❍ can use more power algorithms ❍ Itanium, Crusoe, lots of DSP chips Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 19 20 5 S tatic S cheduling D ynam ic S cheduling Reorder code fragment Dynamic scheduling (hardware) divf f0, f2, f4 divf f0, f2, f4 ❍ handles dependences unknown at compile time addf f10, f0, f8 multf f7, f8, f14 ❍ hardware reorganizes instructions! multf f7, f8, f14 addf f10, f0, f8 ❍ more complex hardware but code more portable Solution ❍ “the real stuff” ❍ multf independent of addf and divf, so reorder • Shorter (more parallel) execution schedule while preserving ❍ addf still stalls illusion of sequential execution 1 2 3 4 5 6 7 8 9 divf IF ID E/ E/ E/ E/ MEM WB multf IF ID E* E* MEM WB addf IF ID -- -- EX+ MEM WB Hides execution of multf Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 21 22 Conceptual M odel Issues to R em em ber Dyanmically build dataflow graph • Dynamic scheduling issues ❒ Respect RAW ❒ Data dependences: RAW (in registers, memory, condition ❒ Respect or avoid WAW, WAR codes, etc.) ❒ Name dependences: WAR and WAW Execute when operands are ready ❒ Traps, interrupts, and exceptions ❒ Exploit parallelism of independent instructions ❒ Control dependences: branches and jumps • Dynamic scheduling algorithms ¢¡¢£¥¤¦ ¢¡¢£¥¤¦§ ❒ Scoreboarding ¢¡¢£¥¤¦£ © ❒ Tomasulo’s algorithm ❒ RUU ¨ ❒ R10K, et al. ¡¢ ¦ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Copyright © 2002 Falsafi, Hill, Sohi, ✆ Smith, Vijaykumar, and Wood Chapter 4: ILP Smith, Vijaykumar, and Wood Chapter 4: ILP 23 24 6 D ynam ic S cheduling: S coreboard D atapath w ith S coreboard Centralized control scheme ❍ controls all instruction issue ❍ detects all hazards Implemented in CDC6000 in 1964 ❍ CDC 6000 has 18 separate functional units (not pipelined) ❍ 4 FP: 2 multiply, 1 add, 1 divide ❍ 7 memory units:
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages39 Page
-
File Size-