United States Patent [19J [11] Patent Number: 6,112,019 Chamdani Et Al
Total Page:16
File Type:pdf, Size:1020Kb
Illlll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111 US006112019A United States Patent [19J [11] Patent Number: 6,112,019 Chamdani et al. [45] Date of Patent: Aug. 29, 2000 [54] DISTRIBUTED INSTRUCTION QUEUE "An Efficient Algorithm for Exploring Multiple Arithmetic Units," Tomasulo IBM Journal, Jan. 1967, pp. 25-33. [75] Inventors: Joseph I. Chamdani, Marietta; Cecil 0. Alford, Lawrenceville, both of Ga. "Implementation of Precise Interrupts In Pipelined Proces sors," James E. Smith Andrew R. Pleszkun,© 1985 IEEE, [73] Assignee: Georgia Tech Research Corp., Atlanta, pp. 36-44. Ga. "Instruction Issue Logic in Pipelined Supercomputers", [21] Appl. No.: 08/489,509 Shlomo Weiss & James E. Smith,© 1984 IEEE, Transac [22] Filed: Jun. 12, 1995 tions on Computers, vol. c-33, No. 11, pp. 1012-1022 Nov. 1984. [51] Int. CI.7 ........................................................ G06F 9/22 [52] U.S. Cl. .............................................................. 395/390 "The Metafiow Architecture", Popescu et al, IEEE Micro,© [58] Field of Search ..................................... 395/376, 378, 1991 IEEE, pp. 10-13, 63-73. 395/381, 382, 391, 393, 390 [56] References Cited Primary Examiner-David Y. Eng Attorney, Agent, or Firm-Thomas, Kayden, Horstemeyer U.S. PATENT DOCUMENTS & Risley, L.L.P. 3,924,245 12/1975 Eaton et al. ....................... 395/421.09 4,725,947 2/1988 Shonai et al. ........................... 395/585 [57] ABSTRACT 4,736,288 4/1988 Shintani et al. .. ... ... ... .... ... ... ... 395 /395 4,752,873 6/1988 Shonai et al. ...................... 395/800.23 A distributed instruction queue (DIQ) in a superscalar 4,780,810 10/1988 Torii et al. .... ... ... ... ... .... ... ... ... 395 /800 microprocessor supports multi-instruction issue, decoupled 4,896,258 1/1990 Yamaguchi et al. .................... 395/583 data flow scheduling, out-of-order execution, register 4,992,938 2/1991 Cocke et al. ............................ 395/393 renaming, multi-level speculative execution, and precise 5,050,067 9/1991 McLogan et al. ...................... 395/678 5,122,984 6/1992 Strehler ..................................... 365/49 interrupts. The DIQ provides distributed instruction shelving 5,129,067 7/1992 Johnson .................................. 395/375 without storing register values, operand value copying, and 5,136,697 8/1992 Johnson .................................. 395/375 result value forwarding, and supports in-order issue as well 5,208,914 5/1993 Wilson et al. .......................... 395/275 as out-of-order issue within its functional unit. The DIQ 5,261,066 11/1993 Jouppi et al. ........................... 395/425 5,345,569 9/1994 Tran ........................................ 395/393 allows a reduction in the number of global wires and 5,355,457 10/1994 Shebanow et al. ..................... 395/394 replacement with private-local wires in the processor. The 5,367,703 11/1994 Levitan . ... ... ... .... ... ... ... ... ... .... .. 395 /800 DIQ's number of global wires remains the same as the 5,371,684 12/1994 Iadonato et al. ........................ 364/491 number of DIQ entries and data size increases. The DIQ 5,414,822 5/1995 Saito et al. .............................. 395/375 maintains maximum machine parallelism and the actual 5,619,730 4/1997 Ando ....................................... 395/855 performance of the microprocessor using the DIQ is better OIBER PUBLICATIONS due to reduced cycle time or more operations executed per Sohi, "Instruction Issue Logic for High-Performance, Inter cycle. ruptible, Multiple Function Unit, Pipelined Computers", 1990 IEEE, pp. 349-359. 22 Claims, 40 Drawing Sheets (from dispatch buses) (to execution unit) 300 340 DIQ alloc_port 310 bottom_DIQ_entry (O .. Nd-1) (issued_DIQ_entry) / 365 mispred_flag 350 Tail Pointer Logic 360 391 Note: insLID ;:;unique instruction tag opcode =opcode of the instruction RS1 =register number of first source operand RS1 _tag =register tag of first source operand RS2 =register number of second source operand RS2_tag ""register tag of second source operand In-order Issue Distributed Instruction Queue(DIQ) •d \JJ. • Fetch Decode Dispatch Issue Execute Writeback Retire ~ ~ Execution Unit results to ...... Result ~ instructions Register Decoder Central ~ • Buffer ...... from Window • [[ File = I-cache __J • w Execution Unit load Store store to Load/Store Unit > Buffer D-cache ~= N ~~ load from D-cache cN c c (a) with a Central Instruction Window Fetch Decode Dispatch Issue Execute Write back Retire 'Jl ~=- ~ Dist. Window Execution Unit results to ..... Result • instructions Register • • Buffer • '"""'0....., from Decoder File • K: • • .i;;.. I-cache • • c Dist. Window I •I Execution Unit Store store to Load/Store Unit r ~1 Dist. Window --- Buffer D-cache ....0--, load from D-cache ~ ~ Fig. 1 ....N (a) with Distributed Instruction Windows =~ \C •d \JJ. • ~ ......~ ~ =...... Note: inst 1 and inst 2 can be a floating-point arithmetic and/or a load/store instruction ~ inst 1 inst 2 ~ N Free List (FL) ~~ S2IS3 OPITIS1 IS2IS3 cN I I I I I c c Mapping Table 33 I 34 I 35 I 36 I 371 38 I 39 32X6 3 'Jl Pending-Target ~= Return Queue (PTRQ) .....~ N 0....., Register Mapping Table in IBM RS/6000 Floating-Point Unit .i;;.. c Fig. 2 ....0--, ~ ~ ....N =~ \C •d \JJ. Register File I : right_opr_bus (operand data • (In-Order State) left_opr_bus to functional units) I ~ ......~ ~ retire bus Result =...... Comparator/ Shift Bypass Network Register • • • entry number I Reorder Buffer ---- J ~ (Look-Ahead State) - result bus (result value & exception condition ~ from a functional unit) N ~~ Result Shift Register cN Reorder Buffer c functional c stage valid tag entry dest. excep- program unit source result valid number reg tions counter N 0 'Jl • • • • • • • • • • ~= • • • • • • • • • • .....~ • • • • • • • • • • ~ shift 0....., 6 tail- direction 1 4 .i;;.. 5 float add c 5 0 0 17 4 0 head- 4 4 0 16 l 3 0 3 2 integer add 1 5 1 0 ....0--, ~ Note: N =the length of longest functional-unit pipeline ~ ....N Fig. 3 Reorder Buffer Organization =~ \C •d \JJ. • tag oocode S1 a(S1) S2 a(S2) D a<D) B(D) 12 ~ top ......~ 16 fadd RO 2 R4 2 RO 2 2 8 ~ 15 fadd R4 1 R6 1 R4 1 1 4 ...... 14 fadd R6 0 R7 0 R6 0 0 0 = 13 fadd R4 0 RS 0 R4 0 0 0 12 fadd RO 1 R2 1 RO 1 1 4 11 fadd R2 0 R3 0 R2 0 0 0 bottom IO fadd RO 0 R1 0 RO 0 0 0 ~ (a) Before Issue ~ N ~~ cN c taa oocode S1 a(S1) S2 a(S2) D a<D) B(D) 12 c top empty spaces ready for subsequent instructions 'Jl ~= 16 fadd RO 1 R4 1 RO 1 1 4 .....~ 15 fadd R4 0 R6 0 R4 0 0 0 .i;;.. bottom T2 fadd RO 0 R? 0 RO 0 0 0 0....., c.i;;.. (b) After Issue and Completion Note: S1/S2 =first/second source register identifier, D =destination register identifier, a(X) = # of times register X is designated as a destination register in preceding inst (below it), P(X) =#of times register Xis designated as a source register in preceding instruction (below it), = issue index= a(S1) + a(S2) + a(D) + p(D). ....0--, ~ 8-Entry Dispatch Stack ~ ....N Fig. 4 =~ \C U.S. Patent Aug. 29, 2000 Sheet 5 of 40 6,112,019 issue writeback 12 15 16 retire I 15 16 (a) Instruction Timing Note: "float add" takes 6 cycles to complete after issued. Entry PC Opcode Source 0 lerand 1Source 0 )erand 2 Destination Executed Excep- Number ·eady tag content ready tag conten tag con ten tions tail - 7 6 16 fadd 0 0.2 - 0 4.2 - 0.3 - 0 - I+-alloc 5 15 fadd 0 4.1 - 0 6.1 - 4.2 - 0 - I+- not rdy 4 14 fadd 1 - R60 1 - R~ 6.1 - 0 - I+-4th issue 3 13 fadd 1 - R40 1 - R50 4.1 - 0 - r+- 3rd issue 2 12 fadd 0 0.1 - 0 2.1 - 0.2 - 0 - -+- not rdy 1 11 fadd 1 - R~ 1 - R30 2.1 - 0 - -+- 2nd issue head - 0 IO fadd 1 - ROO 1 - R10 0.1 - 0 - -+- 1st issue (b) RUU Snapshot at Cycle 7 Note: Rik means the kth instance of register Ri (register renaming) Entry PC Opcode Source Ooerand 1Source Operand 2 Destination Executed Excep- Number eady tag content ready tag conten tag conten tions tail - 7 6 16 fadd 0 0.2 . 0 4.2 - 0.3 - 0 - -+- not rdy 5 15 fadd 0 4.1 - 0 6.1 - 4.2 - 0 - +- not rdy 4 14 fadd 1 - R60 1 - R?O 6.1 - 0 - -+- in exec 3 13 fadd 1 - R40 1 - R50 4.1 - 0 - -+- in exec 1 1 2 12 fadd 1 0.1 R0 1 2.1 R2 0.2 - 0 - -+- ready 1 11 fadd 1 - R20 1 - R30 2.1 R21 1 none -+- wrtback head 0 IO fadd 1 - ROO 1 - R10 0.1 R01 1 none -+- retire - (c) RUU Snapshot at Cycle 9 Register Update Unit Fig. 5 •d \JJ. • Instruction ~ Cache ......~ ~ =...... Fetch & Decode ORIS > Register ~= Retire Write- • File (Deferred-Scheduling, Register-Renaming back • N Instruction Shelf) I • I ~~ cN c c ••• operand buses • • • Issue/Schedule operand and instruction routing 'Jl write back ~= • • • buses .....~ O'I • • • bypass 0....., FU1 FUn buses c.i;;.. functional units • • • ....0--, Metaflow Architecture ~ ~ Fig. 6 ....N =~ \C •d \JJ. • cycle 1 2- 3- 4 5- 6- 7 8- 9- 10. - 11.. allocate IO, 11, 12, 13 14, 15, I6 ~ issue IO 11 13 14 12 15 16 ......~ writeback IO 11 13 14 12 15 ~ retire- IO 11 13 14 12 13 ...... (a) Instruction Timing = Note: Assume there are 4 allocate ports, 4 retire ports, -- 2 floating-point add FUs (class rum 2) with 3-cycle latency. Source Operand 1 Source Operand 2 Destination FU Index Disp. Exec. Opcode PC Locked R num ID Locked R num ID latest Rnum content class ~ 7 ~ N 6 1 0 0.2 1 4 0.5 1 0 - 0 2 0 fadd I6 ._alloc ._alloc ~~ 5 1 4 0.3 1 6 0.4 1 4 - 0 2 0 fadd 15 N c 4 0 6 - 0 7 - 1 6 - 0 2 0 fadd 14 -alloc c c 3 0 4 - 0 5 - 0 4 - 0 2 0 fadd 13 ._ready 2 1 0 0.0 1 2 0.1 0 0 - 0 2 0 fadd 12 .__not rdy 1 0 2 - 0 3 - 1 2 - 1 2 0 fadd 11 -1st iss h 0 0 0 1 0 1 fadd -1st iss 0 - - 0 - 2 0 IO 'Jl Note: Ri'\ means the ktn instance of (b) ORIS Snapshot at Cycle 2 ~= register Ri (register renaming) .....~ Source Operand 1 Source Operand 2 Destination FU -..J Disp.