<<

History of Performance

  )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä ORGANIZATIONALäIDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED

1

Tuesday, April 24, 12 History of Processor Performance

  )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   CSEE 3827 !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä ORGANIZATIONALäIDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED

2

Tuesday, April 24, 12 History of Processor Performance

  )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  COMS 4824 YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä ORGANIZATIONALäIDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED

3

Tuesday, April 24, 12 Abstract Stages of Execution

Instruction Fetch (Instructions fetched from memory into CPU)

Instruction Issue / Execution (Instructions executed on ALU or other functional unit)

Instruction Completion / Commit (Architectural updated, i.e., regfile or memory)

4

Tuesday, April 24, 12 Multiple Instruction Issue Processors Multiple instructions fetched, executed, and committed in each cycle

F: In superscalar processors instructions E: are scheduled by the HW C:

In VLIW processors instructions are F: E: scheduled by the SW C:

In all cases, the goal is to exploit instruction-level parallelism (ILP)

5

Tuesday, April 24, 12 Flynn’s Taxonomy

Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD)

Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD) Exploits instr-level parallelism (ILP)

Exploits data-level parallelism (DLP)

6

Tuesday, April 24, 12 Out-of-order execution

In in-order execution, In out-of-order execution (OOO), instructions are fetched, instructions are fetched, and executed, and committed in committed in , order; may be compiler order executed in some other order

F: F: One stalls, independent One stalls, instrs may they all stall proceed E: E:

Relatively Additional simple hardware HW required for C: C: reordering

Another way to exploit instruction-level parallelism (ILP)

7

Tuesday, April 24, 12 Speculation

Speculation is executing an instruction before it is known that it should be executed Speculate If correct If incorrect

F: F: F:

Mis- E: E: speculation executes E: excess instructions, C: C: costing time and power

C:

8

Tuesday, April 24, 12 The Memory Wall Performance CPU speeds increase approx 25-20% annually DRAM speeds increase approx 2-11% annually

Time A result of this gap is that design has increased in importance over the years. This has resulted in innovations such as victim caches and trace caches.

9

Tuesday, April 24, 12 Modern Processor Performance

While single threaded performance has leveled, multithreaded performance potential scaling.   )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR COMS 6824 0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   + 3UN   6!8ª  COMS 4130 6!8  YEAR  6!8  (Parallel Programming)                

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä ORGANIZATIONALäIDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED

10

Tuesday, April 24, 12   )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY Source: Hennessy and Patterson, “ Architecture: A Quantitative Approach” DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä Tuesday, AprilORGANIZATIONALä 24, 12 IDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED   )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  !-$ª/PTERON ªª'(Z  )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z )NTELª0ENTIUMª))) ªª'(Z  !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY Source: Hennessy and Patterson, “: A Quantitative Approach” DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä Tuesday, AprilORGANIZATIONALä 24, 12 IDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED 5 Area Performance Power MIPS/Watt (%) 4   3 )NTELª8EON ªª'(Z  BITª)NTELª8EON ªª'(Z  2 !-$ª/PTERON ªª'(Z 

Increase 1 )NTELª0ENTIUMª ª'(Z   !-$ª!THLON ªª'(Z 0  Pipelined Superscalar OOO-Speculation Deep Pipelined )NTELª0ENTIUMª))) ªª'(Z -1 !LPHAª! ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª ªª'(Z   !LPHAª ªª'(Z  !LPHAª! ªª'(Z   0OWER0#ª ª'(Z   !LPHAª ªª'(Z  (0ª0! 2)3# ªª'(Z  )"-ª23  YEAR

0ERFORMANCEªVS6!8  -)03ª-ª  -)03ª-   3UN   6!8ª 

6!8  YEAR  6!8                 

&)'52%ää 'ROWTHäINäPROCESSORäPERFORMANCEäSINCEäTHEäMID S 4HISäCHARTäPLOTSäPERFORMANCEäRELATIVEäTOäTHEä6!8ää ASäMEASUREDäBYäTHEä30%#INTäBENCHMARKSäSEEä3ECTIONä ä0RIORäTOäTHEäMID S äPROCESSORäPERFORMANCEäGROWTHäWASäLARGELYäTECHNOLOGY Source: Hennessy and Patterson, “Computer Architecture: A Quantitative Approach” DRIVENäANDäAVERAGEDäABOUTääPERäYEARä4HEäINCREASEäINäGROWTHäTOäABOUTääSINCEäTHENäISäATTRIBUTABLEäTOäMOREäADVANCEDäARCHITECTURALäANDä Tuesday, AprilORGANIZATIONALä 24, 12 IDEASä"Yä  äTHISäGROWTHäLEDäTOä AäDIFFERENCEäINäPERFORMANCEäOFäABOUTäAäFACTORäOFäSEVENä0ERFORMANCEäFORämäOATING POINT ORIENTEDäCALCULATIONSäHASäINCREASEDäEVENäFASTERä3INCEä äTHEäLIMITSäOFäPOWER äAVAILABLEäINSTRUCTION LEVELäPARALLELISM äANDäLONGäMEMORYäLATENCYä HAVEäSLOWEDäUNIPROCESSORäPERFORMANCEäRECENTLY äTOäABOUTääPERäYEARä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED The Power Wall

         #LOCK2ATE          0OWER  0OWER7ATTS

#LOCK2ATE-(Z                     0ENTIUM  #ORE  0RESCOTT 0ENTIUM +ENTSFIELD 0ENTIUM 0ENTIUM 7ILLAMETTE 0RO

1, Ê£°£xÊÊÊÊ œVŽÊÀ>ÌiÊ>˜`Ê*œÜiÀÊvœÀʘÌiÊÝnÈʓˆVÀœ«ÀœViÃÜÀÃʜÛiÀÊiˆ} ÌÊ}i˜iÀ>̈œ˜ÃÊ >˜`ÊÓxÊÞi>ÀðÊ4HE0ENTIUMMADEADRAMATICJUMPINRATEANDPOWERBUTLESSSOINPERFORMANCE 4HE0RESCOTTTHERMALPROBLEMSLEDTOTHEABANDONMENTOFTHE0ENTIUMLINE4HE#ORELINEREVERTSTOA SIMPLERWITHLOWERCLOCKRATESANDMULTIPLEPROCESSORSPERCHIP#OPYRIGHTÚ%LSEVIER )NC!LL RIGHTSRESERVED

12

Tuesday, April 24, 12 Much of it goes back to the

Source: press foils Tuesday, April 24, 12 Much of it goes back to the transistor

individual atoms!

Source: Intel press foils Tuesday, April 24, 12 Much of it goes back to the transistor

individual atoms! = leakage current + defects Source: Intel press foils Tuesday, April 24, 12 A model of power

P = Pswitch + Pleakage 2 Pswitch = Eswitch x F = (C x Vdd ) x F Pleakage = Vdd x I

Tuesday, April 24, 12 Voltage Scaling: DVFS + Near-Threshold Computing

[Source: Dreslinski et al.: Near-Threshold Computing: Recplaiming Moore’s Law Through Energy Efficient Integrated Circuits] Tuesday, April 24, 12 Voltage Scaling: DVFS + Near-Threshold Computing

[Source: Dreslinski et al.: Near-Threshold Computing: Recplaiming Moore’s Law Through Energy Efficient Integrated Circuits] Tuesday, April 24, 12 Chip Area and Power Consumption

1500 Active Power Leakage Power 1000 With leakage power dominating, power envelope to remain constant 500 power consumption roughly proportional to

0 Power Density (Watts/cm2) 90nm 65nm 45nm 32nm 22nm 16nm Source: Shekhar Borkar (Intel)

1000

100 Pollack’s Law: Processor performance grows 10 with sqrt of area Integer Performance

1 1 10 100 1000 10000 100000 Processor Area Source: Shekhar Borkar (Intel) Tuesday, April 24, 12 The Resulting Shift to Multicore

Perf = 1 Power = 1

Tuesday, April 24, 12 The Resulting Shift to Multicore

Perf = 1 Perf = 2 Power = 1 Power = 4

Tuesday, April 24, 12 The Resulting Shift to Multicore

Perf = 1 Perf = 2 Perf = 4 Power = 1 Power = 4 Power = 4

Tuesday, April 24, 12 Sea Change in Architecture: Multicore

(4ª0(9 ªLINKª 3LOWª)/ &USES

 BITª&05

,OAD ,ª$ATA K" -" 3TORE #ACHE , #OREª 3HARED , #ACHE , %XECUTION #TL #ACHE &ETCH (4ª0(9 ªLINKª $ECODE ,ª)NSTR "RANCH #ACHE $ $ 2 .ORTHBRIDGE 0 ( 9

#OREª #OREª (4ª0(9 ªLINKª

(4ª0(9 ªLINKª 3LOWª)/ &USES

&)'52%ää )NSIDEäTHEä!-$ä"ARCELONAäMICROPROCESSORä4HEäLEFT HANDäSIDEäISäAäMICROPHOTOGRAPHäOFäTHEä!-$ä"ARCELONAäPROCESSORä CHIP äANDäTHEäRIGHT HANDäSIDEäSHOWSäTHEäMAJORäBLOCKSäINäTHEäPROCESSORä4HISäCHIPäHASäFOURäPROCESSORSäORähCORESv ä4HEäMICROPROCESSORäINäTHEä LAPTOPäINä&IGUREääHASäTWOäCORESäPERäCHIP äCALLEDäANä)NTELä#OREää$UOä#OPYRIGHTäÚää%LSEVIER ä)NCä!LLäRIGHTSäRESERVED

18

Tuesday, April 24, 12 64- Architecture Evolution

2003 2005 2007 2008 2009 2010

AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” ™ Opteron™

Mfg. 90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333

Max Power Budget Remains Consistent

3 | The AMD “Magny-Cours” Processor | Hot Chips | August 2009

[Source: HotChips ‘09] 19 Tuesday, April 24, 12 Tick%Tock'Development'Model

Sandy Nehalem Westmere Bridge

NEW NEW NEW NEW NEW 1 Process Microarchitecture Process Microarchitecture 65nm 45nm 45nm 32nm 32nm

TOCK TICK TOCK TICK TOCK

Forecast

Nehalem-EX Architecture All products, dates, and figures are preliminary and are subject to change without notice. 6 Hot Chips 2009

[Source: HotChips ‘09] 20 Tuesday, April 24, 12 Nehalem'Core/'Modularity

C C C O O O Core R R R E E … E

L3 Cache DRAM Uncore Power … IM QPI … QPI & C Clock QPI … Common%core%for%client%and%%CPUs – http://www.intel.com/technology/architecture

Nehalem-EX Architecture

7 Hot Chips 2009

[Source: HotChips ‘09] 21 Tuesday, April 24, 12 Nehalem'EX*CPU

Monolithic)single)die)CPU 8)Nehalem)cores,)16)threads 3MB*LLC 3MB LLC 24MB)shared)L3)cache 3MB*LLC 3MB*LLC 2)integrated)memory)controllers Scalable)Memory)Interconnect)(SMI))with) support)for)up)to)8)DDR)channels) 3MB*LLC 3MB*LLC 4)Quick)Path)Interconnect)(QPI))links)with)up)to)) 6.4GT/s 3MB LLC 3MB*LLC Supports)2,)4)and)8)socket)in)glueless)configs) and)larger)systems)using)Node)Controller)(NC) Intel)45nm)process)technology 2.3)Billion)

Nehalem-EX Architecture

8 Hot Chips 2009

[Source: HotChips ‘09] 22 Tuesday, April 24, 12 POWER7: IBM’s Next Generation, Balanced POWER Server Chip

45nm

20+ Years of POWER Processors 65nm Next Gen.

POWER7 RS64IV Sstar 130nm -Multi-core POWER6TM RS64III Pulsar 180nm -Ultra High .18um RS64II North Star .25um POWER5TM -SMT RS64I Apache .35um BiCMOS .5um POWER4TM Major POWER® Innovation .5um Muskie A35 .22um -Dual Core -1990 RISC Architecture .5um -1994 SMP -Cobra A10 -1995 Out of Order Execution -64 bit -1996 64 Bit Enterprise Architecture POWER3TM -1997 Hardware Multi-Threading .35um -630 -2001 Dual Core Processors -2001 Large System Scaling -2001 Shared Caches .72um -2003 On Chip Memory Control POWER2TM P2SC -2003 SMT .25um -2006 Ultra High Frequency RSC .35um -2006 Dual Scope Coherence Mgmt 1.0um -2006 Decimal Float/VSX .6um -2006 Processor Recovery/Sparing 604e -2009 Balanced Multi-core Processor -603 POWER1 -2009 On Chip EDRAM -AMERICA’s -601 1990 1995 2000 2005 2010

3 * Dates represent approximate processor power-on dates, not system availability [Source: HotChips ‘09] 23 Tuesday, April 24, 12 POWER7: IBM’s Next Generation, Balanced POWER Server Chip

POWER7 Processor Chip

567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors Equivalent function of 2.7B eDRAM efficiency Eight processor cores 12 execution units per core 4 Way SMT per core 32 Threads per chip 256KB L2 per core 32MB on chip eDRAM shared L3 Dual DDR3 Memory Controllers 100GB/s Memory per chip sustained Scalability up to 32 Sockets 360GB/s SMP bandwidth/chip 20,000 coherent operations in flight Advanced pre-fetching Data and Instruction Binary Compatibility with POWER6

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability. 4 [Source: HotChips ‘09] 24 Tuesday, April 24, 12 Challenges in Multicore

1. Performance dependent on parallel codes 2. 3. Communication and (“feeding the beast”) coherence

[Source: Sandia NL]

25

Tuesday, April 24, 12 Computer architecture: the long view

Tuesday, April 24, 12 The Transistor

John Bardeen, Walter Brattain, William Shockley Nobel Prize in Physics in 1965

Tuesday, April 24, 12 The

Jack Kilby at and at Fairchild Camera 2000 Nobel Prize in Physics

Tuesday, April 24, 12 Technology: Logic gates SRAM DRAM Circuit Packaging Magnetic storage Biochips 3D stacking

[Credit: Milo Martin, UPenn]

Tuesday, April 24, 12 Computing Devices

Tuesday, April 24, 12 Technology: Logic gates Application domains: SRAM PCs DRAM Servers Circuit technologies PDAs Packaging Mobile phones Magnetic storage Flash memory Game consoles Biochips Embedded 3D stacking

[Credit: Milo Martin, UPenn]

Tuesday, April 24, 12 Computing Applications

Tuesday, April 24, 12 Technology: Logic gates Application domains: SRAM PCs DRAM Servers Circuit technologies PDAs Packaging Mobile phones Magnetic storage Supercomputers Flash memory Game consoles Biochips Embedded 3D stacking

Goals: Functional Performance Reliability Cost Energy efficiency Time to market [Credit: Milo Martin, UPenn]

Tuesday, April 24, 12 Technology: Logic gates Application domains: SRAM PCs DRAM Servers Circuit technologies Computer PDAs Packaging architecture Mobile phones Magnetic storage Supercomputers Flash memory is at the Game consoles Biochips intersection Embedded 3D stacking

Goals: Functional Performance Reliability Cost Energy efficiency Time to market [Credit: Milo Martin, UPenn]

Tuesday, April 24, 12