Review Memory Disambiguation Review Explicit Register Renaming

Total Page:16

File Type:pdf, Size:1020Kb

Load more

5HYLHZꢊꢉ5HRUGHUꢉ%XIIHUꢉꢍ52%ꢎ

&6ꢀꢁꢀ
*UDGXDWHꢉ&RPSXWHUꢉ$UFKLWHFWXUH
/HFWXUHꢉꢃ

‡ 8VHꢉRIꢉUHRUGHUꢉEXIIHU

² ,QꢏRUGHUꢉLVVXHꢌꢉ2XWꢏRIꢏRUGHUꢉH[HFXWLRQꢌꢉ,QꢏRUGHUꢉFRPPLW ² +ROGVꢉUHVXOWVꢉXQWLOꢉWKH\ꢉFDQꢉEHꢉFRPPLWWHGꢉLQꢉRUGHU
ª 6HUYHVꢉDVꢉVRXUFHꢉRIꢉYDOXHVꢉXQWLOꢉLQVWUXFWLRQVꢉFRPPLWWHG

,QVWUXFWLRQꢉ/HYHOꢉ3DUDOOHOLVPꢊ
*HWWLQJꢉWKHꢉ&3,ꢉꢋꢉꢅ

² 3URYLGHVꢉVXSSRUWꢉIRUꢉSUHFLVHꢉH[FHSWLRQVꢂ6SHFXODWLRQꢊꢉVLPSO\ꢉWKURZꢉRXWꢉ LQVWUXFWLRQVꢉODWHUꢉWKDQꢉH[FHSWHGꢉLQVWUXFWLRQꢄ
² &RPPLWVꢉXVHUꢏYLVLEOHꢉVWDWHꢉLQꢉLQVWUXFWLRQꢉRUGHU ² 6WRUHVꢉVHQWꢉWRꢉPHPRU\ꢉV\VWHPꢉRQO\ꢉZKHQꢉWKH\ꢉUHDFKꢉKHDGꢉRIꢉEXIIHU

6HSWHPEHUꢉꢀꢇꢌꢉꢀꢈꢈꢈ
3URIꢄꢉ-RKQꢉ.XELDWRZLF]
‡ ,Qꢏ2UGHUꢏ&RPPLW LVꢉLPSRUWDQWꢉEHFDXVHꢊ

² $OORZVꢉWKHꢉJHQHUDWLRQꢉRIꢉSUHFLVHꢉH[FHSWLRQV ² $OORZVꢉVSHFXODWLRQꢉDFURVVꢉEUDQFKHV

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅ
&6ꢀꢁꢀꢂ.XELDWRZLF]

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

/HF ꢃꢄꢀ

5HYLHZꢊꢉ0HPRU\ꢉ'LVDPELJXDWLRQ
5HYLHZꢊꢉ([SOLFLWꢉ5HJLVWHUꢉ5HQDPLQJ

‡ 0DNHꢉXVHꢉRIꢉDꢉSK\VLFDO UHJLVWHUꢉILOHꢉWKDWꢉLVꢉODUJHUꢉWKDQꢉ
‡ 4XHVWLRQꢊꢉ*LYHQꢉDꢉORDGꢉWKDWꢉIROORZVꢉDꢉVWRUHꢉLQꢉSURJUDPꢉ

QXPEHUꢉRIꢉUHJLVWHUVꢉVSHFLILHGꢉE\ꢉ,6$
RUGHUꢌꢉDUHꢉWKHꢉWZRꢉUHODWHG"

² 7U\LQJꢉWRꢉGHWHFWꢉ5$:ꢉKD]DUGVꢉWKURXJKꢉPHPRU\ꢉ

‡ .H\ꢉLQVLJKWꢊꢉ$OORFDWHꢉDꢉQHZꢉSK\VLFDOꢉGHVWLQDWLRQꢉUHJLVWHUꢉ

² 6WRUHVꢉFRPPLWꢉLQꢉRUGHUꢉꢍ52%ꢎꢌꢉVRꢉQRꢉ:$5ꢂ:$:ꢉPHPRU\ꢉKD]DUGVꢄ

IRUꢉHYHU\ꢉLQVWUXFWLRQꢉWKDWꢉZULWHV

² 5HPRYHVꢉDOOꢉFKDQFHꢉRIꢉ:$5ꢉRUꢉ:$:ꢉKD]DUGV

‡ ,PSOHPHQWDWLRQꢉ

² 6LPLODUꢉWRꢉFRPSLOHUꢉWUDQVIRUPDWLRQꢉFDOOHGꢉ6WDWLFꢉ6LQJOHꢉ$VVLJQPHQW
ª /LNHꢉKDUGZDUHꢏEDVHGꢉG\QDPLFꢉFRPSLODWLRQ"
² .HHSꢉTXHXHꢉRIꢉVWRUHVꢌꢉLQꢉSURJꢉRUGHU ² :DWFKꢉIRUꢉSRVLWLRQꢉRIꢉQHZꢉORDGVꢉUHODWLYHꢉWRꢉH[LVWLQJꢉVWRUHV

‡ 0HFKDQLVP"ꢉꢉ.HHSꢉDꢉWUDQVODWLRQꢉWDEOHꢊ

² ,6$ꢉUHJLVWHUꢉSK\VLFDOꢉUHJLVWHUꢉPDSSLQJ

‡ :KHQꢉKDYHꢉDGGUHVVꢉIRUꢉORDGꢌꢉFKHFNꢉVWRUHꢉTXHXHꢊ

² ,IꢉDQ\ VWRUHꢉSULRUꢉWRꢉORDGꢉLVꢉZDLWLQJꢉIRUꢉLWVꢉDGGUHVVꢌꢉVWDOOꢉORDGꢄ
² :KHQꢉUHJLVWHUꢉZULWWHQꢌꢉUHSODFHꢉHQWU\ꢉZLWKꢉQHZꢉUHJLVWHUꢉIURPꢉIUHHOLVWꢄ ² 3K\VLFDOꢉUHJLVWHUꢉIUHHꢉZKHQꢉQRWꢉXVHGꢉE\ꢉDQ\ꢉDFWLYHꢉLQVWUXFWLRQV
² ,IꢉORDGꢉDGGUHVVꢉPDWFKHVꢉHDUOLHUꢉVWRUHꢉDGGUHVVꢉꢍDVVRFLDWLYHꢉORRNXSꢎꢌꢉ WKHQꢉZHꢉKDYHꢉDꢉPHPRU\ꢀLQGXFHGꢁ5$:ꢁKD]DUGꢊ

ª VWRUHꢉYDOXHꢉDYDLODEOHꢉUHWXUQꢉYDOXH ª VWRUHꢉYDOXHꢉQRWꢉDYDLODEOHꢉUHWXUQꢉ52%ꢉQXPEHUꢉRIꢉVRXUFHꢉ
² 2WKHUZLVHꢌꢉVHQGꢉRXWꢉUHTXHVWꢉWRꢉPHPRU\

‡ $GYDQWDJHVꢉRIꢉH[SOLFLWꢉUHQDPLQJꢊ

² 'HFRXSOHVꢉUHQDPLQJ IURPꢉVFKHGXOLQJꢀ

² $OORZVꢉGDWDꢉWRꢉEHꢉIHWFKHGꢉIURPꢉVLQJOHꢉUHJLVWHUꢉILOH
ª 1RꢉQHHGꢉWRꢉE\SDVVꢉYDOXHVꢉIURPꢉUHRUGHUꢉEXIIHU

‡ :LOOꢉUHOD[ꢉH[DFWꢉGHSHQGHQF\ꢉFKHFNLQJꢉLQꢉODWHUꢉOHFWXUH

ª %HWWHUꢉXWLOL]DWLRQꢉRIꢉUHJLVWHUꢉSRUWV

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢐ
&6ꢀꢁꢀꢂ.XELDWRZLF]

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

/HF ꢃꢄꢑ

*HWWLQJꢉ&3,ꢉꢋꢉꢅꢊꢉ,VVXLQJ
0XOWLSOHꢉ,QVWUXFWLRQVꢂ&\FOH

‡ 6XSHUVFDODUꢊꢉYDU\LQJꢉQRꢄꢉLQVWUXFWLRQVꢂF\FOHꢉꢍꢅꢉWRꢉ
ꢃꢎꢌꢉVFKHGXOHGꢉE\ꢉFRPSLOHUꢉRUꢉE\ꢉ+:ꢉꢍ7RPDVXORꢎ

² ,%0ꢉ3RZHU3&ꢌꢉ6XQꢉ8OWUD6SDUFꢌꢉ'(&ꢉ$OSKDꢌꢉ+3ꢉꢃꢈꢈꢈ

,QVWUXFWLRQꢉ/HYHOꢉ3DUDOOHOLVP

‡ +LJKꢉVSHHGꢉH[HFXWLRQꢉEDVHGꢉRQꢉLQVWUXFWLRQ
OHYHO SDUDOOHOLVP ꢍLOSꢎꢊꢉSRWHQWLDOꢉRIꢉVKRUWꢉ LQVWUXFWLRQꢉVHTXHQFHVꢉWRꢉH[HFXWHꢉLQꢉSDUDOOHO

‡ ꢍ9HU\ꢎꢉ/RQJꢉ,QVWUXFWLRQꢉ:RUGVꢉꢍ9ꢎ/,:ꢊꢉ

IL[HGꢉQXPEHUꢉRIꢉLQVWUXFWLRQVꢉꢍꢑꢏꢅꢒꢎꢉVFKHGXOHGꢉ E\ꢉWKHꢉFRPSLOHUꢓꢉSXWꢉRSVꢉLQWRꢉZLGHꢉWHPSODWHV

² -RLQWꢉ+3ꢂ,QWHOꢉDJUHHPHQWꢉLQꢉꢅꢆꢆꢆꢂꢀꢈꢈꢈ"

‡ +LJKꢏVSHHGꢉPLFURSURFHVVRUVꢉH[SORLWꢉ,/3ꢉE\ꢊ

ꢅꢎꢉSLSHOLQHGꢉH[HFXWLRQꢊꢉRYHUODSꢉLQVWUXFWLRQV
² ,QWHOꢉ$UFKLWHFWXUHꢏꢒꢑꢉꢍ,$ꢏꢒꢑꢎꢉꢒꢑꢏELWꢉDGGUHVV
ꢀꢎꢉ2XWꢏRIꢏRUGHUꢉH[HFXWLRQꢉꢍFRPPLWꢉLQꢏRUGHUꢎ
ª 6W\OHꢊꢉ´([SOLFLWO\ꢉ3DUDOOHOꢉ,QVWUXFWLRQꢉ&RPSXWHUꢉꢍ(3,&ꢎµ
² 1HZꢉ681ꢉ0$-,&ꢉ$UFKLWHFWXUHꢊꢉ9/,:ꢉIRUꢉ-DYD

ꢐꢎꢉ0XOWLSOHꢉLVVXHꢊꢉLVVXHꢉDQGꢉH[HFXWHꢉPXOWLSOHꢉ
LQVWUXFWLRQVꢉSHUꢉFORFNꢉF\FOH

‡ 9HFWRUꢉ3URFHVVLQJꢊ

ꢑꢎꢉ9HFWRUꢉLQVWUXFWLRQVꢊꢉPDQ\ꢉLQGHSHQGHQWꢉRSVꢉVSHFLILHGꢉ
ZLWKꢉDꢉVLQJOHꢉLQVWUXFWLRQ

([SOLFLWꢉFRGLQJꢉRIꢉLQGHSHQGHQWꢉORRSVꢉDVꢉ RSHUDWLRQVꢉRQꢉODUJHꢉYHFWRUVꢉRIꢉQXPEHUV

² 0XOWLPHGLDꢉLQVWUXFWLRQVꢉEHLQJꢉDGGHGꢉWRꢉPDQ\ꢉSURFHVVRUV

‡ 0HPRU\ꢉDFFHVVHVꢉIRUꢉKLJKꢏVSHHGꢉ
PLFURSURFHVVRU"
‡ $QWLFLSDWHGꢉVXFFHVVꢉOHDGꢉWRꢉXVHꢉRIꢉ

²

'DWDꢉ&DFKHꢉSRVVLEO\ꢉPXOWLSRUWHGꢌꢉPXOWLSOHꢉOHYHOV

,QVWUXFWLRQVꢉ3HUꢉ&ORFN F\FOHꢉꢍ,3&ꢎꢉYVꢄꢉ&3,

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢁ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢒ

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

*HWWLQJꢉ&3,ꢉꢋꢉꢅꢊꢉ,VVXLQJ
5HYLHZꢊꢉ8QUROOHGꢉ/RRSꢉWKDWꢉ

0LQLPL]HVꢉ6WDOOVꢉIRUꢉ6FDODU
0XOWLSOHꢉ,QVWUXFWLRQVꢂ&\FOH

‡ 6XSHUVFDODU '/;ꢊꢉꢀꢉLQVWUXFWLRQVꢌꢉꢅꢉ)3ꢉIꢉꢅꢉDQ\WKLQJꢉ

1 Loop: LD
F0,0(R1)

LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

HOVH

234

LD LD LD
F6,-8(R1) F10,-16(R1) F14,-24(R1)

² )HWFKꢉꢒꢑꢏELWVꢂFORFNꢉF\FOHꢓꢉ,QW RQꢉOHIWꢌꢉ)3ꢉRQꢉULJKW ² &DQꢉRQO\ꢉLVVXHꢉꢀQGꢉLQVWUXFWLRQꢉLIꢉꢅVWꢉLQVWUXFWLRQꢉLVVXHV ² 0RUHꢉSRUWVꢉIRUꢉ)3ꢉUHJLVWHUVꢉWRꢉGRꢉ)3ꢉORDGꢉIꢉ)3ꢉRSꢉLQꢉDꢉSDLU

5678

ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2

  • 7\SH
  • 3LSH6WDJHV

,QWꢄꢉLQVWUXFWLRQ

)3ꢉLQVWUXFWLRQ

,QWꢄꢉLQVWUXFWLRQ

)3ꢉLQVWUXFWLRQ

,QWꢄꢉLQVWUXFWLRQ

)3ꢉLQVWUXFWLRQ

,) ,' (; 0(0 :%

,) ,' (; 0(0 :%

,) ,' (; 0(0 :%

,) ,' (; 0(0 :%

,) ,' (; 0(0 :%

,) ,' (; 0(0 :%

9

SD SD SD
0(R1),F4 -8(R1),F8 -16(R1),F12

10 11 12 13 14

SUBI R1,R1,#32 BNEZ R1,LOOP

  • SD
  • 8(R1),F16

; 8-32 = -24

‡

ꢅꢉF\FOHꢉORDGꢉGHOD\ꢉH[SDQGVꢉWRꢉꢐꢉLQVWUXFWLRQV LQꢉ66

14 clock cycles, or 3.5 per iteration

² LQVWUXFWLRQꢉLQꢉULJKWꢉKDOIꢉFDQ·WꢉXVHꢉLWꢌꢉQRUꢉLQVWUXFWLRQVꢉLQꢉQH[W &V6ORꢁꢀWꢂ.XELDWRZLF]

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢃ

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

/HF ꢃꢄꢇ

'\QDPLFꢉ6FKHGXOLQJꢉLQꢉ6XSHUVFDODU
7KHꢉHDV\ꢉZD\
/RRSꢉ8QUROOLQJꢉLQꢉ6XSHUVFDODU

,QWHJHUꢁLQVWUXFWLRQ

/'ꢉꢉꢉꢉ)ꢈꢌꢈꢍ5ꢅꢎ

  • )3ꢁLQVWUXFWLRQ
  • &ORFNꢁF\FOH

  • /RRSꢊ

ꢀꢐꢑꢁꢒꢇꢃꢆ

‡ +RZꢉWRꢉLVVXHꢉWZRꢉLQVWUXFWLRQVꢉDQGꢉNHHSꢉLQꢏRUGHUꢉ

/'ꢉꢉꢉꢉ)ꢒꢌꢏꢃꢍ5ꢅꢎ /'ꢉꢉꢉꢉ)ꢅꢈꢌꢏꢅꢒꢍ5ꢅꢎ /'ꢉꢉꢉꢉ)ꢅꢑꢌꢏꢀꢑꢍ5ꢅꢎ /'ꢉꢉꢉꢉ)ꢅꢃꢌꢏꢐꢀꢍ5ꢅꢎ 6'ꢉꢉꢉꢉꢈꢍ5ꢅꢎꢌ)ꢑ 6'ꢉꢉꢉꢉꢏꢃꢍ5ꢅꢎꢌ)ꢃ 6'ꢉꢉꢉꢉꢏꢅꢒꢍ5ꢅꢎꢌ)ꢅꢀ 6'ꢉꢉꢉꢉꢏꢀꢑꢍ5ꢅꢎꢌ)ꢅꢒ 68%,ꢉꢉꢉ5ꢅꢌ5ꢅꢌLꢑꢈ %1(=ꢉꢉ5ꢅꢌ/223 6'ꢉꢉꢉꢉꢏꢐꢀꢍ5ꢅꢎꢌ)ꢀꢈ

LQVWUXFWLRQꢉLVVXHꢉIRUꢉ7RPDVXOR"

$'''ꢉ)ꢑꢌ)ꢈꢌ)ꢀ $'''ꢉ)ꢃꢌ)ꢒꢌ)ꢀ $'''ꢉ)ꢅꢀꢌ)ꢅꢈꢌ)ꢀ $'''ꢉ)ꢅꢒꢌ)ꢅꢑꢌ)ꢀ $'''ꢉ)ꢀꢈꢌ)ꢅꢃꢌ)ꢀ
² $VVXPHꢉꢅꢉLQWHJHUꢉꢔꢉꢅꢉIORDWLQJꢉSRLQW ² ꢅꢉ7RPDVXOR FRQWUROꢉIRUꢉLQWHJHUꢌꢉꢅꢉIRUꢉIORDWLQJꢉSRLQW

‡ ,VVXHꢉꢀ;ꢉ&ORFNꢉ5DWHꢌꢉVRꢉWKDWꢉLVVXHꢉUHPDLQVꢉLQꢉRUGHU ‡ 2QO\ꢉ)3ꢉORDGVꢉPLJKWꢉFDXVHꢉGHSHQGHQF\ꢉEHWZHHQꢉ
LQWHJHUꢉDQGꢉ)3ꢉLVVXHꢊ

² 5HSODFHꢉORDGꢉUHVHUYDWLRQꢉVWDWLRQꢉZLWKꢉDꢉORDGꢉTXHXHꢓꢉ RSHUDQGVꢉPXVWꢉEHꢉUHDGꢉLQꢉWKHꢉRUGHUꢉWKH\ꢉDUHꢉIHWFKHG
ꢅꢈ

ꢅꢅ ꢅꢀ
² /RDGꢉFKHFNVꢉDGGUHVVHVꢉLQꢉ6WRUHꢉ4XHXHꢉWRꢉDYRLGꢉ5$:ꢉYLRODWLRQ ² 6WRUHꢉFKHFNVꢉDGGUHVVHVꢉLQꢉ/RDGꢉ4XHXHꢉWRꢉDYRLGꢉ:$5ꢌ:$: ² &DOOHGꢉ´GHFRXSOHGꢉDUFKLWHFWXUHµꢊꢉFRPSDUHꢉZLWKꢉ6PLWKꢉSDSHU

‡ 8QUROOHGꢉꢁꢉWLPHVꢉWRꢉDYRLGꢉGHOD\VꢉꢍꢔꢅꢉGXHꢉWRꢉ66ꢎ ‡ ꢅꢀꢉFORFNVꢌꢉRUꢉꢀꢄꢑꢉFORFNVꢉSHUꢉLWHUDWLRQꢉꢍꢅꢄꢁ;ꢎ

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢆ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢈ

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

0XOWLSOHꢉ,VVXHꢉ&KDOOHQJHV
9/,:ꢊꢉ9HU\ꢉ/DUJHꢉ,QVWUXFWLRQꢉ:RUG

‡ :KLOHꢉ,QWHJHUꢂ)3ꢉVSOLWꢉLVꢉVLPSOHꢉIRUꢉWKHꢉ+:ꢌꢉJHWꢉ&3,ꢉ
RIꢉꢈꢄꢁꢉRQO\ꢉIRUꢉSURJUDPVꢉZLWKꢊ

² ([DFWO\ꢉꢁꢈOꢉ)3ꢉRSHUDWLRQV

‡ (DFKꢉ´LQVWUXFWLRQµꢉKDVꢉH[SOLFLWꢉFRGLQJꢉIRUꢉPXOWLSOHꢉ
RSHUDWLRQV

² 1RꢉKD]DUGV

‡ ,IꢉPRUHꢉLQVWUXFWLRQVꢉLVVXHꢉDWꢉVDPHꢉWLPHꢌꢉJUHDWHUꢉ

² ,Qꢉ(3,&ꢌꢉJURXSLQJꢉFDOOHGꢉDꢉ´SDFNHWµ

GLIILFXOW\ꢉRIꢉGHFRGHꢉDQGꢉLVVXHꢊ

² ,Qꢉ7UDQVPHWDꢌꢉJURXSLQJꢉFDOOHGꢉDꢉ´PROHFXOHµꢉꢍZLWKꢉ´DWRPVµꢉDVꢉRSVꢎ
² (YHQꢉꢀꢏVFDODUꢉ !ꢉH[DPLQHꢉꢀꢉRSFRGHVꢌꢉꢒꢉUHJLVWHUꢉVSHFLILHUVꢌꢉIꢉGHFLGHꢉ

‡ 7UDGHRIIꢉLQVWUXFWLRQꢉVSDFHꢉIRUꢉVLPSOHꢉGHFRGLQJ

LIꢉꢅꢉRUꢉꢀꢉLQVWUXFWLRQVꢉFDQꢉLVVXH

  • ² 7KHꢉORQJꢉLQVWUXFWLRQꢉZRUGꢉKDVꢉURRPꢉIRUꢉPDQ\ꢉRSHUDWLRQV
  • ² 5HJLVWHUꢉILOHꢊꢉQHHGꢉꢀ[ꢉUHDGVꢉDQGꢉꢅ[ꢉZULWHVꢂF\FOH

² 5HQDPHꢉORJLFꢊꢉPXVWꢉEHꢉDEOHꢉWRꢉUHQDPHꢉVDPHꢉUHJLVWHUꢉPXOWLSOHꢉWLPHVꢉ
² %\ꢉGHILQLWLRQꢌꢉDOOꢉWKHꢉRSHUDWLRQVꢉWKHꢉFRPSLOHUꢉSXWVꢉLQꢉWKHꢉORQJꢉ

LQꢉRQHꢉF\FOHRꢉꢉ)RUꢉLQVWDQFHꢌꢉFRQVLGHUꢉꢑꢏZD\ꢉLVVXHꢊ
LQVWUXFWLRQꢉZRUGꢉDUHꢉLQGHSHQGHQWꢉ !ꢉH[HFXWHꢉLQꢉSDUDOOHO

add r1, r2, r3 sub r4, r1, r2 lw r1, 4(r4) add r5, r1, r2 add p11, p4, p7 sub p22, p11, p4 lw p23, 4(p22) add p12, p23, p4

² (ꢄJꢄꢌꢉꢀꢉLQWHJHUꢉRSHUDWLRQVꢌꢉꢀꢉ)3ꢉRSVꢌꢉꢀꢉ0HPRU\ꢉUHIVꢌꢉꢅꢉEUDQFK

ª ꢅꢒꢉWRꢉꢀꢑꢉELWVꢉSHUꢉILHOGꢉ !ꢉꢇꢕꢅꢒꢉRUꢉꢅꢅꢀꢉELWVꢉWRꢉꢇꢕꢀꢑꢉRUꢉꢅꢒꢃꢉ ELWVꢉZLGH
² 1HHGꢉFRPSLOLQJꢉWHFKQLTXHꢉWKDWꢉVFKHGXOHVꢉDFURVVꢉVHYHUDOꢉEUDQFKHV
,PDJLQHꢉGRLQJꢉWKLVꢉWUDQVIRUPDWLRQꢉLQꢉDꢉVLQJOHꢉF\FOHR

² 5HVXOWꢉEXVHVꢊꢉ1HHGꢉWRꢉFRPSOHWHꢉPXOWLSOHꢉLQVWUXFWLRQVꢂF\FOH
ª 6RꢌꢉQHHGꢉPXOWLSOHꢉEXVHVꢉZLWKꢉDVVRFLDWHGꢉPDWFKLQJꢉORJLFꢉDWꢉHYHU\ꢉ UHVHUYDWLRQꢉVWDWLRQꢄ
ª 2UꢌꢉQHHGꢉPXOWLSOHꢉIRUZDUGLQJꢉSDWKV

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢅ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢀ

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

5HFDOOꢊꢉ8QUROOHGꢉ/RRSꢉWKDWꢉ
0LQLPL]HVꢉ6WDOOVꢉIRUꢉ6FDODU
/RRSꢉ8QUROOLQJꢉLQꢉ9/,:

  • Memory
  • Memory
  • FP
  • FP

op. 2
Int. op/ branch
Clock reference 1 reference 2 operation 1

1 Loop: LD
F0,0(R1)

LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

234

LD LD LD
F6,-8(R1) F10,-16(R1) F14,-24(R1)

  • LD F0,0(R1) LD F6,-8(R1)
  • 1

234567
LD F10,-16(R1) LD F14,-24(R1)

  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2
  • ADDD F8,F6,F2

5678

ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2

  • LD F26,-48(R1)
  • ADDD F12,F10,F2 ADDD F16,F14,F2

ADDD F20,F18,F2 ADDD F24,F22,F2

  • SD 0(R1),F4
  • SD -8(R1),F8 ADDD F28,F26,F2

SD -16(R1),F12 SD -24(R1),F16 SD -32(R1),F20 SD -40(R1),F24 SD -0(R1),F28

9

SD SD SD
0(R1),F4 -8(R1),F8 -16(R1),F12

SUBI R1,R1,#48 BNEZ R1,LOOP
89

10 11 12 13 14

SUBI R1,R1,#32 BNEZ R1,LOOP SD

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency

8(R1),F16

; 8-32 = -24

14 clock cycles, or 3.5 per iteration
Note: Need more registers in VLIW (15 vs. 6 in SS)

ꢆꢂꢀꢇꢂꢈꢈ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢐ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢑ
ꢆꢂꢀꢇꢂꢈꢈ

5HFDOOꢊꢉ6RIWZDUHꢉ3LSHOLQLQJꢉ([DPSOH
5HFDOOꢊꢉ6RIWZDUHꢉ3LSHOLQLQJ

‡ 2EVHUYDWLRQꢊꢉLIꢉLWHUDWLRQVꢉIURPꢉORRSVꢉDUHꢉLQGHSHQGHQWꢌꢉ
WKHQꢉFDQꢉJHWꢉPRUHꢉ,/3ꢉE\ꢉWDNLQJꢉLQVWUXFWLRQVꢉIURPꢉ GLIIHUHQW LWHUDWLRQV

‡ 6RIWZDUHꢉSLSHOLQLQJꢊꢉUHRUJDQL]HVꢉORRSVꢉVRꢉWKDWꢉHDFKꢉ
LWHUDWLRQꢉLVꢉPDGHꢉIURPꢉLQVWUXFWLRQVꢉFKRVHQꢉIURPꢉ GLIIHUHQWꢉLWHUDWLRQVꢉRIꢉWKHꢉRULJLQDOꢉORRSꢉꢍ ꢉ7RPDVXOR LQꢉ

%HIRUHꢊꢉ8QUROOHGꢉꢐꢉWLPHV

After: Software Pipelined

ꢅꢉ LD F0,0(R1)
1 SD
0(R1),F4 ; Stores M[i]
 ADDD F4,F0,F2

2 ADDD F4,F0,F2 ; Adds to M[i-1]

 SD
0(R1),F4

F6,-8(R1)

  • 3 LD
  • F0,-16(R1);Loads M[i-2]

 LD

4 SUBI R1,R1,#8 5 BNEZ R1,LOOP

 ADDD F8,F6,F2 ꢒ SD

 LD

-8(R1),F8

F10,-16(R1)

SW Pipeline

 ADDD F12,F10,F2

6:ꢎ

Iteration

 SD
-16(R1),F12

0
Iteration

1

ꢅꢈ SUBI R1,R1,#24

Iteration

  • 2
  • Iteration

3

Time

ꢅꢅ BNEZ R1,LOOP

Iteration
4

Loop Unrolled

Symbolic Loop Unrolling

Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop

Softwarepipelined iteration

Time

vs. once per each unrolled iteration in loop unrolling

&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢁ
&6ꢀꢁꢀꢂ.XELDWRZLF]
/HF ꢃꢄꢅꢒ

  • ꢆꢂꢀꢇꢂꢈꢈ
  • ꢆꢂꢀꢇꢂꢈꢈ

6RIWZDUHꢉ3LSHOLQLQJꢉZLWK
/RRSꢉ8QUROOLQJꢉLQꢉ9/,:
7UDFHꢉ6FKHGXOLQJ

‡ 3DUDOOHOLVPꢉDFURVVꢉ,)ꢉEUDQFKHVꢉYVꢄꢉ/223ꢉEUDQFKHV ‡ 7ZRꢉVWHSVꢊ

Memory reference 1

  • Memory
  • FP
  • FP
  • Int. op/
  • Clock

  • reference 2
  • operation 1
  • op. 2 branch

² 7UDFHꢁ6HOHFWLRQ

LD F0,-48(R1) LD F6,-56(R1) LD F10,-40(R1)
ST 0(R1),F4 ST -8(R1),F8 ST 8(R1),F12
ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2
1

Recommended publications
  • ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills

    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills

    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills The ARM Cortex­A product line has changed significantly since the introduction of the Cortex­A8 in 2005. ARM’s next major leap came with the Cortex­A9 which was the first design to incorporate multiple cores. The next advance was the development of the big.LITTLE architecture, which incorporates both high performance (A15) and high efficiency(A7) cores. Most recently the A57 and A53 have added 64­bit support to the product line. The ARM Cortex series cores are all made up of the main processing unit, a L1 instruction cache, a L1 data cache, an advanced SIMD core and a floating point core. Each processor then has an additional L2 cache shared between all cores (if there are multiple), debug support and an interface bus for communicating with the rest of the system. Multi­core processors (such as the A53 and A57) also include additional hardware to facilitate coherency between cores. The ARM Cortex­A57 is a 64­bit processor that supports 1 to 4 cores. The instruction pipeline in each core supports fetching up to three instructions per cycle to send down the pipeline. The instruction pipeline is made up of a 12 stage in order pipeline and a collection of parallel pipelines that range in size from 3 to 15 stages as seen below. The ARM Cortex­A53 is similar to the A57, but is designed to be more power efficient at the cost of processing power. The A57 in order pipeline is made up of 5 stages of instruction fetch and 7 stages of instruction decode and register renaming.
  • Out-Of-Order Execution & Register Renaming

    Out-Of-Order Execution & Register Renaming

    Asanovic/Devadas Spring 2002 6.823 Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002 6.823 Scoreboard for In-order Issue Busy[unit#] : a bit-vector to indicate unit’s availability. (unit = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? not Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest] Asanovic/Devadas Spring 2002 Out-of-Order Dispatch 6.823 ALU Mem IF ID Issue WB Fadd Fmul • Issue stage buffer holds multiple instructions waiting to issue. • Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard. • Any instruction in buffer whose RAW hazards are satisfied can be dispatched (for now, at most one dispatch per cycle). On a write back (WB), new instructions may get enabled. Asanovic/Devadas Spring 2002 6.823 Out-of-Order Issue: an example latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 3 4 SUBD F8, F2, F2 1 5 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) . 2 3 4 4 3 5 . 5 6 6 Out-of-order: 1 (2,1) 4 4 . 2 3 . 3 5 .
  • Dynamic Register Renaming Through Virtual-Physical Registers

    Dynamic Register Renaming Through Virtual-Physical Registers

    Dynamic Register Renaming Through Virtual-Physical Registers † Teresa Monreal [email protected] Antonio González* [email protected] Mateo Valero* [email protected] †† José González [email protected] † Víctor Viñals [email protected] †Departamento de Informática e Ing. de Sistemas. Centro Politécnico Superior - Univ. de Zaragoza, Zaragoza, Spain. *Departament d’ Arquitectura de Computadors. Universitat Politècnica de Catalunya, Barcelona, Spain. ††Departamento de Ingeniería y Tecnología de Computadores. Universidad de Murcia, Murcia, Spain. Abstract Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper present a novel dynamic register renaming scheme that delays the allocation of physical registers until a late stage in the pipeline. We show that it can provide important savings in number of physical registers so it can significantly shorter the register file access time. Delaying the allocation of physical registers requires some artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which are tags that do not require any storage location. The proposed renaming scheme shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance. 1. Introduction Dynamically-scheduled superscalar processors exploit instruction-level parallelism (ILP) by overlapping the execution of instructions in an instruction window.
  • Computer Architecture Out-Of-Order Execution

    Computer Architecture Out-Of-Order Execution

    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
  • Hardware-Sensitive Database Operations - II

    Hardware-Sensitive Database Operations - II

    Faculty of Computer Science Database and Software Engineering Group Hardware-Sensitive Database Operations - II Balasubramanian (Bala) Gurumurthy Advanced Topics in Databases, 2019/May/17 Otto-von-Guericke University of Magdeburg So Far... ● Hardware evolution and current challenges ● Hardware-oblivious vs. hardware-sensitive programming ● Pipelining in RISC computing ● Pipeline Hazards ○ Structural hazard ○ Data hazard ○ Control hazard ● Resolving hazards ○ Loop-Unrolling ○ Predication Bala Gurumurthy | Hardware-Sensitive Database Operations Part II We will see ● Vectorization ○ SIMD Execution ○ SIMD in DBMS Operation ● GPUs in DBMSs ○ Processing Model ○ Handling Synchronization Bala Gurumurthy | Hardware-Sensitive Database Operations Vectorization Leveraging Modern Processing Capabilities Hardware Parallelism One we know already: Pipelining ● Separate chip regions for individual tasks to execute independently ● Advantage: parallelism + sequential execution semantics ● We discussed problems of hazards ● VLSI tech. limits degree up to which pipelining is feasible [Kaeslin, 2008] Bala Gurumurthy | Hardware-Sensitive Database Operations Hardware Parallelism Chip area can be used for other types of parallelism: Computer systems typically use identical hardware circuits, but their function may be controlled by different instruction stream Si: Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture? Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture?
  • Trends in Processor Architecture

    Trends in Processor Architecture

    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
  • WCAE 2003 Workshop on Computer Architecture Education

    WCAE 2003 Workshop on Computer Architecture Education

    WCAE 2003 Proceedings of the Workshop on Computer Architecture Education in conjunction with The 30th International Symposium on Computer Architecture DQG 2003 Federated Computing Research Conference Town and Country Resort and Convention Center b San Diego, California June 8, 2003 Workshop on Computer Architecture Education Sunday, June 8, 2003 Session 1. Welcome and Keynote 8:45–10:00 8:45 Welcome Edward F. Gehringer, workshop organizer 8:50 Keynote address, “Teaching and teaching computer Architecture: Two very different topics (Some opinions about each),” Yale Patt, teacher, University of Texas at Austin 1 Break 10:00–10:30 Session 2. Teaching with New Architectures 10:30–11:20 10:30 “Intel Itanium floating-point architecture,” Marius Cornea, John Harrison, and Ping Tak Peter Tang, Intel Corp. ................................................................................................................................. 5 10:50 “DOP — A CPU core for teaching basics of computer architecture,” Miloš BeþváĜ, Alois Pluháþek and JiĜí DanƟþek, Czech Technical University in Prague ................................................................. 14 11:05 Discussion Break 11:20–11:30 Session 3. Class Projects 11:30–12:30 11:30 “Superscalar out-of-order demystified in four instructions,” James C. Hoe, Carnegie Mellon University ......................................................................................................................................... 22 11:50 “Bridging the gap between undergraduate and graduate experience in
  • MIPS Architecture with Tomasulo Algorithm [12]

    MIPS Architecture with Tomasulo Algorithm [12]

    CALIFORNIA STATE UNIVERSITY NORTHRIDGE TOMASULO ARCHITECTURE BASED MIPS PROCESSOR A graduate project submitted in partial fulfilment for the requirement Of the degree of Master of Science In Electrical Engineering By Sameer S Pandit May 2014 The graduate project of Sameer Pandit is approved by: ______________________________________ ____________ Dr. Ali Amini, Ph.D. Date _______________________________________ ____________ Dr. Shahnam Mirzaei, Ph.D. Date _______________________________________ ____________ Dr. Ramin Roosta, Ph.D., Chair Date California State University Northridge ii ACKNOWLEDGEMENT I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr. Shahnam for his ideas, and Dr. Ali Amini for his utmost support. I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation. iii Table of Contents SIGNATURE PAGE .......................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................. iii LIST OF FIGURES .......................................................................................................... vii ABSTRACT .......................................................................................................................
  • In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)

    In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)

    IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 5, MAY 2006 1 In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs) Aamer Jaleel, Student Member, IEEE, and Bruce Jacob, Member, IEEE Abstract—The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs.
  • Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers

    Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers

    Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers Aamer Jaleel and Bruce Jacob Electrical & Computer Engineering University of Maryland at College Park {ajaleel,blj}@eng.umd.edu Abstract. The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In par- ticular, many of these instructions have reached a very deep stage in the pipeline - representing significant work that is wasted. In addition, an overhead of several cycles can be expected in re-fetching and re-executing these instructions. This paper concentrates on improving the performance of precisely handling software managed translation lookaside buffer (TLB) interrupts, one of the most frequently occurring interrupts. This paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. We simulate two different schemes of in-lining the interrupt on a pro- cessor with a 4-way out-of-order core similar to the Alpha 21264.
  • Complex Pipelining: Out-Of-Order Execution & Register Renaming

    Complex Pipelining: Out-Of-Order Execution & Register Renaming

    1 Complex Pipelining: Out-of-Order Execution & Register Renaming Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Krste Asanovic and Arvind 6.823 L12-2 Arvind In-Order Issue Pipeline ALU Mem IF ID Issue WB Fadd GPR’s FPR’s Fmul Fdiv October 24, 2005 6.823 L12-3 Arvind Scoreboard for In-order Issues Busy[FU#] : a bit-vector to indicate FU’s availability. (FU = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending. These bits are set to true by the Issue stage and set to false by the WB stage Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest] October 24, 2005 6.823 L12-4 Arvind In-Order Issue Limitations: an example latency 1 2 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 4 3 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) . 2 3 4 4 3 5 . 5 6 6 In-order restriction prevents instruction 4 from being dispatched October 24, 2005 6.823 L12-5 Arvind Out-of-Order Issue ALU Mem IF ID Issue WB Fadd Fmul • Issue stage buffer holds multiple instructions waiting to issue. • Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.
  • Superscalar Techniques – Register Data Flow Inside the Processor

    Superscalar Techniques – Register Data Flow Inside the Processor

    Advanced Computer Architectures 03 Superscalar Techniques – Data flow inside processor as result of instructions execution (Register Data Flow) Czech Technical University in Prague, Faculty of Electrical Engineering Slides authors: Michal Štepanovský, update Pavel Píša B4M35PAP Advanced Computer Architectures 1 Superscalar Technique – see previous lesson • The goal is to achieve maximum throughput of instruction processing • Instruction processing can be analyzed as instructions flow or data flow, more precisely: • register data flow – data flow between processor registers • instruction flow through pipeline Today’s lecture • memory data flow – to/from memory topic • It roughly matches to: • Arithmetic-logic (ALU) and other computational instructions (FP, bit- field, vector) processing • Branch instruction processing • Load/store instruction processing • maximizing the throughput of these three flows (or complete flow) correspond to the minimizing penalties and latencies of above three instructions types B4M35PAP Advanced Computer Architectures 2 Superscalar pipeline – see previous lesson Fetch Instruction / decode buffer Decode Dispatch buffer Dispatch Reservation stations Issue Execute Finish Reorder / Completion buffer Complete Store buffer Retire B4M35PAP Advanced Computer Architectures 3 Register data flow • load/store architecture – lw,sw,... instructions without other data processing (ALU, FP, …) or modifications • We start with register-register instruction type processing (all instructions performing some operation on source registers and using a destination register to store result of that operation) • Register recycling – The reuse of registers. It has two forms: • static – due to optimization performed by the compiler during register allocation. In first step, compiler generates single-assignment code, supposing infinity number of registers. In the second step, due to limited number of registers in ISA, it attempts to keep as many of the temporary values in registers as possible.