<<

Cap 1 Introduction pag 1 Fundamental Design Issues Evolution Parallel of Convergence and Architectures WhyParallel Architecture? What Parallel is Architecture?

Introduction

2

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Some broad issues: A parallel is a collection ofprocessing elements that that cooperate to solve large problems fast • • • Performance and Scalability and Performance Synchronization Data access,and Communication Resource Allocation: – – – – – – – – what what are the abstractions and primitives for cooperation? how are data transmitted between processors? how do the elements cooperate and communicate? how much memory? how powerful are the elements? how large a collection? how how does it scale? (satura o crescimento ou não) how does it all translate into performance? What isArchitecture? Parallel

3

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 4

Parallelism: Role a of computer architect: 1.1 Why StudyParallelArchitecture? • • • • Is increasingly central in information processing in information central Is increasingly architecture view which to fromIs perspective a fascinating , sincronização)comunicação, pipeline, do livro: system (ênfase of design Applies at alllevels tecnologia) de (limitações performance for clock faster to Provides alternative To design and engineer the various levels of a computer system technology to maximize

and performance cost .

and

programmability

within limits of

4

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Need to understand fundamental principles and design tradeoffs, Technological trends makeparallel inevitable Rapidly maturing under strong technological constraints History: diverse and innovative organizational structures, often not just taxonomies tied to novel programming models • • • • • performance Communication dados), (de Replication Naming, Ordering, In mainstreamthe converge approaches cause diverse to trends Technological Laptops similar!fundamentally supercomputersareand is ubiquitous micro” The “killer

Why Study it Today?

5

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Current Current trends Economics Architecture Trends Technology Trends Application demands: • • • • • • • • • General Physics, Chemistry, ... computing Scientific Tomorrow’s are multiprocessors (hoje multicore)(hoje multiprocessorsare microprocessors Tomorrow’s COMPAQ!... SGI,DEC, Sun, becomingMP: and Servers support have multiprocessormicroprocessorsToday’s Coarser Instruction Clock only slowly up expected to ratesgo rapidly chip growing on transistors of Number Inevitability ofParallelComputing

- - level parallelism, as in MPs, the most viable approach viable as inMPs, themost level parallelism, purpose purpose computing - : level parallelism valuable but limited but valuable level parallelism

: CFD (Computational Fluid Dynamics), Biology, Biology, Dynamics), Fluid : CFD (Computational Our insatiable insatiable Our need computingfor cycles

: Video, Graphics, CAD, Databases, TP... : Graphics,CAD, Video,

6

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 6 Goal of applications in using parallel machines: Speedup Range ofperformance demands Demand forcycles fuels advances in hardware, and vice For a fixed problem size (input data set), performance = 1/time

• • • • Speedup Speedup (p= processors) Platform pyramid pyramid Platform cost increasing progressively with performance systemNeed of range applications demanding mostharder: architecture Drives parallel performance in increasemicroprocessor Cycle exponential drives fixed problem 1.1.1 Application Trends

(pprocessors)=

Performance(1 ) Performance(p processors)

Time (p processors) Time (1 processor)

-

versa

7

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 7

Scientific ComputingDemand Fig 1.2, pag 7: 1993 dados de

8

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 8 Large Large parallel machines a mainstay in manyindustries

• • • • • • • • Financial modeling (yield and derivative analysis) andderivative (yield Financial modeling Visualization modeling) (molecularPharmaceuticals Computer electromagnetism), mechanics,analysis, Aeronautics engine structural efficiency, (airflow efficiency), analysis, combustion simulation, drag Automotive (crash analysis) (reservoir Petroleum etc. – – – architecture (walk entertainment (films like Toy Story, centenas de SUNs) in all of the above

Engineering Computing Demand - aided design aided

-

throughs and rendering)

9

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 9 Applications: Speech and Image Processing

• • 100 MIPS 100 100 processors gets 20 you you 10 years, 1000 gets (hoje) (hoje) Also CAD, Databases, . . 10 MIPS 10 GIPS 1 MIPS 1 GIPS1 1980 Speech Coding Sub-Band Recognition Isolated Speech 200 1985 W ords

V Speaker Recognition Number T eri¼ elephone cation Speech Coding CELP Recognition Speech Continuous 1,000 1990 Receiver ISDN-CDStereo W ords CIF HDTV V ideo Receiver Recognition Speech Continuous 5,000 1995 W ords

!

10

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 9 128 proc Paragon • • •

891 MFLOP on 128 145 MFLOP on406MFLOP Cray90,for final 128 versionon incorporaramotimizaçõescódigono visando a paralelização) Starting pointwasvector codefor simulationprogram AMBER (Assisted ModelBuildingthroughEnergyRefinement) dynamics molecular Learning ParallelApplications Curve for

- processor Crayprocessor - 1 (equiv.à versão8; 1 versões 9 e12

- processor Paragon,processor

11

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 10 TPC (Transaction TPC (Transaction Processing Performance Council) benchmarks Databases, online Also relies onparallelism for high end (TPC mining, data warehousing ... • • • • •

performance measure (transactions per minute or minuteor per measure(transactions performance as fixed size longer Problem no system of scales size with Size enterprise of provided criteria scaling Explicit Computational determinesscale business power of that canbe handled moreuse wide but much so large,Scale not - C entry,TPCorder Commercial Computing - transaction processing, decision support, data - D decision decision D support)

p

increases, so throughput is used as a so throughput increases,

-

spread

tpm

)

12

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 10 • • • - platforms (sistemasmostrados lançados emdatas diferentes) Difficult to obtain snapshot to compare across vendor Small to moderate scale parallelismvery important Parallelismispervasive 12 •

Ex: Tandem émais velho que e DECAlpha PowerPC

Throughput (tpmC) TPC 10,000 15,000 20,000 25,000 5,000 0 0                                                       -     C Results for March1996 Other IBM erPC Pow HP P SGI erChallenge Pow DEC Alpha T andem Himalaya        A 20  

40 Number processors of 60  80 100 

120

13

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 12 Solid Solid application demandexists and will increase Demand forimproving throughput on sequential workloads Desktop also uses multithreaded programs, which are a lot like In rapid progress in commercial computing Transition to has occurred for scientific and parallel parallel programs engineering computing • • •

Greatest use of smallGreatest of use Usually smaller Database aswellfinancial and transactions Summary ofApplication Trends -

scale, but large but scale, - scale multiprocessors scale

- scale systems alsoused scale

14

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 15 e 4 The natural building block for multiprocessors is now also about is also about fastest!the now for multiprocessors block building The natural

Performance 100 - 0.1 5 10

1965 1 1.1.2 Technology Trends 1970 Mainframes Minicomputers 1975 1980 1985 Microprocessors 32bits 32bits e cache

1990 1995

15

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Linpack SPEC pag 13

• • • • •

DRAM size DRAM performance parallelism is a natural way to improve it. improve to way parallelismis anatural Not that single market commodityby huge is carried generation per Huge investment 100 120 140 160 180 20 40 60 80 0 1987

Sun Sun 4 260

General Technology Trends General Technology

1988 M/120 MIPS

quadruples every years 3 every quadruples

- 1989 M2000 processor performance is plateauing, but that that but is plateauing, performance processor MIPS

doubles every years 3 every doubles

RS6000 1990 540 IBM

1991 HP 9000

750 increases 50% 50% increases

1992 alpha DEC

Tempo 2x acesso de Tamanho 1000x DRAM de -

100% per year 100% per Integer ´ 80 a 80

´ 95

FP

16

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 12 How to use moretransistors? Performance > decade; 100x per clock rate transistor 10x, rest count Die size is growingtoo Basic advance is Fundamental issue is resource distribution, uniprocessors as in

• • • • • • Both Both resources, so tradeoff need access data Locality in Parallelismin processing like improves transistors of Number in toimprovement proportional Clock roughly improves rate in power lower or faster either Circuits become – – – also improves processor utilization avoids latency and reduces CPI multiple operations per cycle reduces CPI Technology: Look Technology: A Closer decreasing decreasing feature size

( (or faster) (or

)

Proc Interconnect

$

17

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 13

30% per year per 30% Clock Growth Rate Clock Frequency

Clock rate (MHz) 1,000 100

0.1 10 1970 1   i4004 i8086 1975 i8008    i8080  1980   1985 i80286     i80386 1990                                       1995               Pentium100 Pentium100  2000 2005

18

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 13

• • • - Transistor rate clock than faster muchgrows count 3.1 Billion 9500, 2012: Transistor 2012: Nvidia GK110

40% 40% per year, order of magnitude more contribution in 2 decades Transistor Count Growth Rate

Transistors 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 1970 - based 7.1 Billion Billion 7.1 based Transistor  i4004  i8008 1975  i8080 i80286   1980  i8086   1985      1990 i80386   R2000                       1995

 R3000           R10000  etu 2000

2005

19

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 14 Parallelism increases effective size of each level ofhierarchy, Larger memoriesare slower, while processors get faster Divergence between memory capacity and speed morepronounced Disks too: Parallel Disks too: Parallel plus caching disks Parallelism and locality within memorytoo systems

without without increasing access time • • • • • • • Buffer caches most recently accessed cachesdata Buffer recently most interface pipelined narrower acrosstransfer fast with follow chip; memory within bits New designs many fetch caches? How toorganize cacheNeed hierarchies deeper parallel data in moreNeedtransfer to greater speedmuch processor with DRAMgap c. 2000, but Gigabit by 1980 from1000x Capacity increased by Similar StoryforStorage

- 95, speed only 95, speed only 2x

20

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 14 Greatest Greatest delineation in has been VLSIin type of parallelism exploited Four generations architectural of history: tube, transistor, IC, VLSI Understanding microprocessor architectural trends Resolves the tradeoff between parallelism and locality Architecture translates technology’s gifts to performance and capability • • • • •

Here focus only generation Hereon VLSI focus only “sequential” computerseven in parallelism of role Shows fundamental machines issues design about parallel or Helps intuition build advances and technology scalechange with may Tradeoffs compute,cache, off Current microprocessor:1/3

1.1.3 Architectural Trends

- chip connect chip

21

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 15 Greatest Greatest trend in generation VLSI is increase in parallelism • • • - 17 Next step:level parallelism levelparallelism Mid 90s: instruction mid80s to 4 parallelism: bit level Up to1985:

– – – – – – greater sophistication: out of order execution, speculation, prediction on pipelining and simple instruction sets, + compiler advances (RISC) great inflection point 32 when adoption of 64 slows after 32 bit • - chip caches and functional units => superscalar execution to deal with control transfer and latency problems Architectural Trends - bit now under way, 128

- bit micro and cache fit on a chip (ver fig 1.1)

- bit bit - - bit far (not performance issue) > 8 bit bit > 8 - > 16 >

- bit

22

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 16 • •

Thread is How good instruction

Transistors 100,000,000 10,000,000 - 1,000,000 level needed in microprocessors? in level needed 100,000 10,000 1,000 Phases GenerationPhases in VLSI 1970  i4004   1975 i8008 Bit-level parallelism  i8080  i80286 i8086  1980 - level parallelism?    1985     i80386 R2000  Instruction-level   1990                       R3000 Pentium          R10000  1995

  

2000 Thread-level (?) 2005

23

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 19

• • Reported speedups for superscalar processors forsuperscalar speedups Reported Large variance due to difference in difference due to Large variance – – capabilities of processor modeled application domain investigated (numerical versus non • Butler et al. [1991] ...... • Melvin and Patt [1991] ...... • Wall [1991] ...... • Lee, Kwok, and Briggs [1991] ...... • Jouppi and Wall [1989] ...... • Chang et al. [1991] ...... • Murakami et al. [1989] ...... • Smith, Johnson, and Horowitz [1989] ...... • Wang and Wu [1988] ...... • Horst, Harris, and Jardine [1990] ...... Architectural Trends:ILP

- numerical)

17+ 5 3.50 3.20 2.90 2.55 2.30 1.70 1.37 8

24

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp – • pag 18 Recursos Recursos ilimitados; única restrição é dependência de dados renaming and prediction branch perfect bandwidth, and fetch resourcesInfinite

– Fraction of total cycles (%)

real caches and non 10 15 20 25 30 5 0 0 Number issued instructions of 1 2 3 - ILP Ideal Potential zero zero miss latencies 4 5 6+

Speedup 0.5 1.5 2.5 0 2 3 1 0   issued  5

 10 15  25

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

• Concentrate on parallelism for 4 for Concentrate on parallelism • Realistic studies show only Realistic show 2 only studies • Recent studies show that more ILP needs to look across threads Results of ILPStudies - fold speedup fold - issue machines

26

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 19 • • Faster processors began to saturate , then bus technology advanced bustechnology bus, then tosaturate began Faster processors memory shared to many connect to it natural Micro on a chip makes

Architectural Trends: Bus – – today, range of sizes for bus dominates server and enterprise market, moving down to desktop No. of processors in fully configured commercial shared

Number of processors 10 20 30 40 50 60 70 1984 0 Sequent B2100  Sequent B8000 SGI erSeries Pow 1986  1988  - based based systems, desktop to large servers Symmetry81 Symmetry21 1990 SS690MP 120 SS690MP 140   o er Pow Sun SC2000 1992    SGI Challenge SS1000   SS10 SE60 SE10 - memory systems 1994 AS2100 CRA      - SGI erChallenge/XL Pow based MPs AS8400 Y CS6400Y SE70        SS20 HP K400 SS1000E SE30 SC2000E 1996   Sun E6000 E10000 Sun   P-Pro

1998

27

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 20

100,000

• logic) glue único sem barramento um a ligar 4 processadores • Shared bus bandwidth (MB/s) 10,000

multiprocessamento de pequena escala tornou escala de pequena multiprocessamento Pro: (Pentium multiprocessador para suporte com vêm já processadores muitos 1,000 100 10 1984 Sequent B8000  1986  Sequent B2100 Bus Bandwidth 1988  SGI erSeries Pow Symmetry81/21 1990 SS690MP 140 SS690MP 120  SGI Challenge 1992   o er Pow - se commodity se SS1000   SE60 SE10/ SS10/ SC2000 o erCh Pow 1994 SGI

XL   

      

AS2100 HPK400 SE70/SE30 AS8400 SS20 SS1000E SC2000E 1996   Sun E10000 CS6400 Sun E6000   P-Pro 1998 28

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 20 Desktop: few smaller processors versus one larger one larger one? versus Desktop: smaller few processors Standardization by Intel makes small, bus (e.g. database) vendors software Multiprocessors being by pushed Commodity not but only fast CHEAP microprocessors as as well vendors hardware as

• • • BUT, • •

Development cost is tens of millions of dollars dollars (5 of millions cost is of tens Development Multiprocessor a chip on Multiprocessor special than no more architectures parallel Exotic block building commodity use the and investment, the of advantage take to Crucial

many

more are sold compared to supercomputers to compared sold are more Economics

- based SMPs commodity SMPs based

- - 100 typical) 100 purpose

29

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 21 Large Proving driver ground and for innovative architecture and techniques

• • • • • Well under way already way Well under Plus economics gains floating in huge have Microprocessors made 70s in machinesstarting vector Dominated by mainstreamas MPs become commercial to Market relative smaller – – – – - scale multiprocessors replace vector supercomputers effective use of caches (e.g., automatic blocking) instruction pipelined floating point units (e.g., multiply high clock rates 1.1.4 Scientific Supercomputing - level parallelism

- add add every cycle)

- point performancepoint

30

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 22 Raw Uniprocessor Performance: LINPACK Performance: Raw Uniprocessor 1000 1000 1000 x e 100 100 x matriz 2 pontos:

LINP ACK (MFLOPS) 10,000 1,000 100 10 1975 1

   

CRA Micro Micro CRA Y

Y

 

n n n n

1980

CRA

= 100= = 100= 1,000= 1,000=

Y 1s

 

Xmp/14se 1985  

Xmp/416

Sun 4/260



Ymp

 

MIPS M/120

   

MIPS M/2000

1990  

IBM RS6000/540

 

HP 9000/750 C90  

DEC Alpha AXP      

HP9000/735 DEC Alpha     T94 MIPS R4400

1995    

IBM Power2/990

DEC 8200

2000

31

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp • • pag 24 Since 1993, Cray produces MPPs (Massively Parallel Processors) too (T3D, T3E) Even vector Crays became parallel: X

Raw LINPACK ParallelPerformance:

LINPACK (GFLOPS) 10,000 1,000 100 0.1 10 1985 1

Xmp

/416(4)  

CRA MPP peak

1987 Y peak

Ymp/832(8) 1989 

-

CM MP (2

- 2

   CM

nCUBE/2(1024) iPSC/860 - 200 - 1991 4) 4) Y

- MP (8), C Paragon XP/S MP Delta 

CM

- 5 1993 Paragon XP/S Paragon XP/S MP

  

(1024) C90(16)

T932(32) -

 

90 90 (16), (32) T94

(6768)

T3D

1995 ASCI Red

1996

32

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 24 SMP: Symmetric Shared Memory Multiprocessors MemoryShared Symmetric SMP: PVP: Parallel Vector Processors Processors Massively Parallel MPP:

Number of systems 100 150 200 250 300 350 50 11/93 0

  

187

313

500 Fastest

11/94 239

63   

198

11/95 106 110 284   

  Ver http://www.top500.org/

SMP MPP PVP 106 319

73

11/96   

33

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Parallelism exploited at manylevels Increasingly central and mainstream Increasingly attractive Wide range parallel of architectures makesense Same from story memory system perspective Focus this of class: multiprocessor level parallelism of Summary: Why ParallelArchitecture? • • • • • • Different cost, performance and scalability and cost, performance Different memories local many with latency reduceaverage Increase bandwidth, Large Multiprocessor servers Instruction demandapplication architecture, Economics, technology, - scale multiprocessors (“MPPs”) multiprocessorsscale - level parallelism

34

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 1.2 Convergence of Parallel Architectures

pag 25 pag 25 Historically, parallel architectures tied to programming models •

• Divergentof growth. predictablepattern with no architectures, Arrays Systolic Uncertainty of direction paralyzed parallel software development! software parallel paralyzed direction Uncertainty of Dataflow

Application Application Software Architecture System Software History

Shared Shared Memory

SIMD Message Passing

36

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 25 Compilers, libraries are and OS important bridges today Defines Extension “computerof architecture” to support communication and cooperation

• • • • Organizational structures that implement interfaces (hw or sw) or (hw interfaces implementthat structures Organizational (interfaces) primitives and euser/system), (HW/SW boundaries Critical abstractions, NEW: Architecture OLD: Instruction Set

Communication Architecture Communication

1.2.1 Today

37

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 26 simples ou não •

Distância Distância entre um nível e o próximo indicam se o mapeamento é

Multipr •

ex: acesso ex: acesso a uma variável ogramming

CAD • •

Message Message passing: envolve library ou system call SAS: simplesmente ld ou st

Modern Layered Framework Physical communication medium Physical communication Communication har Communication

Database addr Shar Compilation ess or or library

ed

Message passing

Scientific Scientific modeling dwar

Operating Operating systems support

e

parallel Data

Har User/system boundary abstraction Communication Pr Parallel applications ogramming ogramming models

dwar

e/softwar

e e boundary

38

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 26 Examples: Specifies communication and synchronization What programmer in uses coding applications • • • •

Data Data parallel Message passing Shared address space level program synch. or at no communicationMultiprogramming: – Implemented with shared address space or message passing

: more regimented, global data on actions global regimented, : more Programming Model : like letters or phone calls, explicit point to point calls,point explicit phone : lettersor like : like bulletin board bulletin : like

39

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 27 Result Result is convergence in organizational structure Today: Lot debate of about what to support in andgapbetween sw layers Supported directly via or by hw,or via OS,user sw User level communication primitives provided • • • • • •

Hw/sw interface tends to be flat, i.e. complexity roughly uniform roughly complexity i.e. flat, be tends to Hw/sw interface and these primitives model programmingof Mapping primitives language exists between modelRealizes the programming Relatively simple, general purpose communication primitives communicationpurpose Relatively simple,general influence exertstrong trends Technology asbridges today roles play important Compilers software and

Communication Communication Abstraction

40

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Goals: Implementation: User/System Interface: • • • • • • • • • Low Low Cost Scalability Programmability Broad applicability Performance network of Structure How into processingthey? How integratednode? optimizedare OS or hw primitives: the implementthat structures Organizational user to exposed Comm.primitives

Communication Architecture = User/System = User/System + Interface Implementation

- level by system hw and level by

- level sw level

41

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 28 Examine programmingmodel, motivation, intended applications, and Others: • • • Evolution helps understand convergence Historically machines tailored to programmingmodels contributions contributions to convergence Data Parallel Message Passing Shared Address Space

• • • • Systolic Systolic Arrays Dataflow concepts core Identify as the“architecture” together lumped andmachineorganization abstraction, comm.model,Prog. Evolution of ArchitecturalModels

42

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 28 Popularly Popularly known as Naturally provided on wide range platformsof Convenient: Any processor can 1.2.2 SpaceArchitecturesShared Address

• • • • • • Communication and stores loads asresultoccurs implicitly of Ambiguous: memory may be physically processors among distributed be physically may Ambiguous: memory processors hundreds of to scale: of few Wide range 60s early in mainframesof precursors History dates atleastto transparency Location Similar programming model to timeto model programmingSimilar – – Good Good throughput multiprogrammedon workloads Except processes run on different processors

directly shared shared memory

SMP: shared memorymultiprocessor

reference reference any memory location

machines or model - sharing on on uniprocessors sharing

43

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 29 Portions Portions of address spaces of processes are shared : virtual address space plus one or more threads of • synchronization for operations special atomic comm.; operations for • • Natural extension of uniprocessors model: conventional memory conventional model:uniprocessors of Natural extension Writes sharedaddress processesto other too) tootherthreads (in visible OS uses shared memory to coordinate processes to coordinate OS memoryuses shared

control

Shared AddressSpaceModel S t

o

r P

e 0

P 1

of of addressspace Private portion of addressspace Sharedportion P via shared addresses collection of processes communicating Virtual address spaces for a L 2

o

a

d

P

n

Machine physicaladdress space P P P P addresses Common physical 2 1 0

n

p p p p

r

r r

r

i i i

i

v v v v

a

a a

a

t

t t

t

e e e e

44

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 29 Already Already onehave processor, or more memorymodules and I/O Also natural extension uniprocessor of (estrutura apenas aumentada) controllers controllers connected by hardware interconnect of somesort Memory capacity Memory capacity increased by adding modules, bycontrollers I/O

• • For higher Add processors for processing! Mem Communication Communication Hardware - throughput multiprogramming, or parallel programs Mem Pr Inter ocessor connect Mem Mem Inter connect Pr I/O ctrl

ocessor I/O ctrl devices I/O

45

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 29 “Minicomputer” approach “Mainframe” approach

• • • • • • • • • • • • Low incremental cost Low incremental Bus bottleneck is bandwidth uniprocessor for Latency than larger (SMP) multiprocessorCalled symmetric computing parallel Used heavily for TP Motivated by multiprogramming, Almostsystemsmicroprocessor have all bus instead use cost; multistageHigh incremental Bandwidth scales with small to limited cost processor Originally and I/O bw used memfor Extends crossbar Motivated by multiprogramming – – caching caching key: is later, cost of crossbar coherence

p

problem History

(Ver fig. 1.16)

I/O I/O I/O

C I/O

C C P P C M M M M P $ M P $ M 46

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 33

Example: Intel PentiumProQuad • • • contr Interrupt Low latency bandwidth and Low latency high volume at targeted Highly integrated, processor module glue in and All coherence Bus interface oller CPU car P-Pr PCI I/O 256-KB ds L 2 o bus (64-bito data,bus 36-bit addr $ bridge PCI PCI bus module P-Pr o

bridge PCI PCI bus module P-Pr o - - r4way 4-w or 2-, 1-,

ess, 66 MHz)ess, interleaved contr DRAM Memory module MIU P-Pr oller o

47

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 35

• • • Higher bandwidth, higher latency bus latency higher Higher bandwidth, bus, sosymmetricAll accessedover memory I/O or + memory, (UltraSparc) type: processors 16 either cards of Example:

Gigaplane data, (256 addr 41 bus u nefc/witch interface/sw Bus

$ P $

100bT, SCSI 2 Bus interface Bus $ P SBUS $ 2

SBUS Memctrl

SBUS

2 FiberChannel ess, 83 MHz) 83 ess, car CPU/mem I/O car ds ds

48

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp • • • • Caching shared (particularly nonlocal) data? nonlocal) Caching (particularly shared non or memoryDistributed Dance (bus) bandwidth or cost (crossbar) is Probleminterconnect: – – across across a general Construct shared address space out of simple message transactions latencies to memoryuniform, but uniformly large - hall: bandwidth still scalable, but lower cost crossbar than lower stillscalable,but bandwidth hall: “Dance hall” P $ M Network - P $ M purpose network (e.g. read Scaling Up

P $ M - (NUMA) memory uniform M Distributed memory P $

Network - request, read M

P $

M - response) P $

49

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 37

• • • No hardware mechanism for coherence (SGI Origin etc. provide this) etc. provide Origin (SGI coherence for mechanismNo hardware references nonlocal for request comm. generates Scale to up processors (Alpha,6 vizinhos), 480MB/s 1024 links Example (NUMA): Cray Example (NUMA): Exter nal nal I/O

X

Switch

and and NI P $ Mem ctrl

Z

Mem

Y

50

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 37 Programming model more removed (mais distante) from basic High Programming model: directly access only private address space Complete computer asbuilding block, including I/O

hardware hardware operations (local memory),comm. via explicit messages (send/receive) • • • • • - 1.2.3 Message Passing Architectures 1.2.3 MessagePassing level level block diagram similar to distributed Library intervention OS or Library than scalableEasier build SAS to nó) por hámonitor/teclado (não integration tighter but (clusters), workstations of Like networks system memory be into needn’t at IO level, But integrated comm. memória) (enãovia de Communicationoperações I/O operations via explicit

- memory SAS

51

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 38 • • • • • • • Many overheads: copying, protection management, buffer Many copying, overheads: synch event achieves pairwise match send/recv the In simplestform, space in too process/tag entities User dataand process local names on receive rule Optional and matchingon send tag processes to nameneed but copy, Memory memoryto into to receive storage Recv process and specifiesapplication sending and receivingto betransmittedprocess Send buffer specifies (local)

– Other Other variants too Addr Message ess ess

X

addr Local pr Pr ocess

ess ess space

ocess

P

- Send Passing Abstraction X, X, Q, t

Match

Receive Y

,

P ,

t

addr Local pr Pr

ocess ocess ess ess space

ocess

Q

Addr

ess

Y

52

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 39 Diminishing role oftopology Early machines ( link

Evolution of Message • • • • • • • Simplifies programmingSimplifies Cost is in node madeit less so routing pipelined of Introduction important topology routing: Store&forward vizinho) nomearprocessador (só importante topologia No início, blocking ops non Replaced enabling by DMA, synchronous ops Model; Hw prog. close to

– until recv Buffered by system at destination

´

85): FIFO on each on each FIFO 85): - network interface network

-

- Passing Machines 011 001 • • Topologias típicas:

mesh hipercubo 101 111

000 010

110 100

53

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 41 • • bus) I/O by limited I/O (bw bus in integrated interface Network workstations RS6000 complete essentially out of Made

-ots itches sw 8-port formed ork fr netw General inter Example: IBM SP connection om - 2

I/O Micr contr Pow er 2 er Pow Memory CPU L 2 Memory bus oChannel bus $ oller i860 IBM SP-2 node interleaved DRAM - ay 4-w DMA NI NIC

DRAM 54

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 41

Sandia’ s Intel Paragon XP/S-based Super XP/S-based Paragon sIntel tahdt vr witch sw toattached every ith pr w ork 2D grid netw ocessing node ocessing Example Intel Paragon computer interleaved Memory 50 (64-bit, MHz)bus i860 L DRAM - ay 4-w Mem 1 ctrl $ i860 L 1 Driver $

bidir 175 MHz, 8 bits, node Paragon Intel DMA ectional NI 55

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 42 Programming Programming models distinct, but organizations converging Even clusters of workstations/SMPs are systems parallel (the network is the Hardware organization converging too Evolution and role of software have blurred boundary (SASx MP) computer 1.2.4 Toward ArchitecturalConvergence • • • • • • • •

At lower level, even hardware SAS passes hardware messages Tighter NI integration even for MP (low Page Can construct global address space on MP using hashing Send/recv supported on SAS machines viabuffers Implementations also converging, at least in high Nodes connected by general network and communication assists Emergence of fast system area networks (SAN) - based based (or finer

- grained) shared

- latency, high

- end machines

-

bandwidth)

56

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 44 Original motivations Architectural model Programmingmodel Outros nomes: processorarraySIMD ou

• • • • • • (que era grande) • • Centralizehighcost of instructionfetch/sequencing Matches simpledifferential equationsolvers globalsynchronization Specializedandgeneralcommunication, cheap Attached to acontrolprocessorthat issuesinstructions Arrayofmany simple,cheapprocessors withlittle each memory Conceptually,a processorassociatedeach withelement data Logicallysinglethreadof control, performs sequentialor parallelsteps vetor) Operations performedineach parallelon element ofdataou structure (array – Processors don’t sequence through instructions through sequence don’t Processors

1.2.5 Systems Data Parallel

PE PE PE

PE PE PE pr Contr ocessor ol

PE PE PE 57

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Other examples: Some recent machines: • • • • • • else then 100K > salary If • Maspar MP Machines Thinking CM ... imageprocessing, Document graphics, searching, ... algebra, linear Finite differences, Some single step isa whole operation the Logically, salary his/her with record anemployee contains Each PE salary = salary *1.10 salary = salary *1.05 salary = salary Application of DataParallelism

processors enabled for arithmetic operation, others disabled others operation, arithmetic processors for enabled - 1 and MP 1 and

- 2, 2, - 1, 1, CM

- 2 (and (and CM 2

- 5) (ver fig 1.25) fig (ver 5)

58

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 47 Other Other demisereasons for Popular when cost savings centralized of sequencer high Rigid control structure in Flynntaxonomy)(SIMD Prog. model with converges

• • • • • • • • • Structured global address space, implemented with either SAS or MP or SAS either with address space, implementedglobal Structured synchronization global fast Contributes for need to hardwiring due Loss applicability of (cache tãobem émaiscomo)genérica e funciona anyway wellcan do locality, good have applications Simple, regular microprocessors modernwith true No longer processadores umúnico emchip) de 1 bit Revived midin midReplaced in by vectors 60s was CPU acabinet when MIMD= multiprocessor =uniprocessor, SISD – – More flexible w.r.t. memory layout and easier to manage MIMD machines as effective for data parallelism and more general Evolution andConvergence - 80s when when 32 80s

SPMD (single (single programSPMD multiple data) - 70s (grande simplificação)(grande 70s -

bit slices just fit on chip (32 (32 just on chip slicesfit datapath bit

59

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 47

Represent Represent computation as a graph essential of dependences d d c = a a +1) (b = f = a • • • Matching

W

T d stor e Tag compared with others in matching store; matchfires execution (message token = tag (address) + data) Message (tokens) carrying tag of next instruction sent to next processor Logical processor at eachnode, activated by availability of operands oken aiting

1.2.6 (1) DataflowArchitectures e

(b (b

1

c)

Instruction Pr + stor fetch

ogram

e

b

a

T oken oken queue

f Execute

c

d

Dataflow graph

token Form e

Network Network Network

resultado resultado adiante executa e passa na rede; se “match” (token) do que fazer Busca instrução

60

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 48 Converged to use conventional andmemoryprocessors Key characteristics Dataflow Lasting Lasting contributions:

• • • • • • • • • Remained useful concept for software (compilers etc.) synchronization Tightly integrated communication and fine Integration communicationof with thread (handler) generation But separation progr. ofmodel fromhardware (like data Typically shared address space as well Support large, for dynamic set threads of to mapto processors Ability to name operations, synchronization, dynamic scheduling Dinâmico: função complexa executada pelo nó Estático: cada nó representa umaoperação primitiva

Evolution and Convergence

- grained grained

- parallel)

61

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 49 Represent Represent algorithms directly by chips connected in regular pattern Initial motivation: enables VLSIinexpensive Different fromSIMD: each PE maydo something different Different frompipelining

• • • Orchestrate data flow for high throughput access less memorywith throughput high for data flow Orchestrate elementsprocessing regular of array Replace with processor single (small) local instruction and data memory instruction local (small) data flow, multidirection structure, Nonlinear array may have eachPE 1.2.6 (2) SystolicArchitectures PE M

PE PE M

special - purpose PE

chips

62

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 50 Example: Systolic array for 1

• • • Specialized, and same problems as SIMD sameproblems Specialized, and channels But interconnect dedicated processors use general quite iWARP) (e.g. Practical realizations – – – General General purpose systems work well for same algorithms (locality etc.) Data transfer directly from register to register across channel Enable variety of algorithms on same hardware y x Systolic Arrays (contd.) 3 8

x

7 y (

i

) = x y 6

w 2

1

x

x

5

(

i ) + x y

4 w 1

2

- y x

w

in in D convolution D

x x 3 (

4

i

+ 1) 1) ++ x 2

w x

w y x w

3

out out x

3 1

x (

i

+ 2) 2) ++

w

2

w

y x

4 out out = out =

x

x = (

i

y x

x

+ 3) +

in +

in w

1

w

x in

63

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 51 1.2.7 1.2.7 Convergence: Generic Parallel Architecture Node: processor(s), Node: processor(s), memorysystem,plus A generic modernmultiprocessor • • •

Modelo Modelo de programação Convergence allows lots innovation, of within now framework Scalable network • • • Ver efeito para SAS,MP, Data Parallel e Systolic Array how efficiently...operations, node, assist of what with Integration controller and communication Network interface

Mem P $ - > efeito no Communication Assist Communication assist (CA) assist

communication communication assist

ework Netw

64

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 1.3 Fundamental Design Issues pag 52 Guiding Guiding principles provided by layers Design of user/system and hardware/software interface (Decisões) Foco deve ser em: Architectural distinctions that affect software Focusing on programming models not enough, nor hardware structures Traditional taxonomies not very useful (SIMD/MIMD) (porque multiple general CommunicationinterfaceAbstraction:o modelo entrede purpose purpose processors are dominant) instruções emconvencionais computadores programaçãoe ado implem. sistema:importânciaequivalente aode conjunto

• • • • • • How How they are mapped to hardware How programming models mapto these What primitives are provided at communication abstraction Constrained from above by progr. models and below by technology Compilers, libraries, programs Same one can be supported by radically different architectures Understanding Understanding ParallelArchitecture

programação Modelo Modelo de abstraction

Comm. Comm. HW

66

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 53 Data At any layer, interface (contrato entre e HWaspect SW)and • ... • Other issues Understand at programming model first,since that sets requirements among operations performance aspects (deve permitir melhoria individual) Node Node Granularity • • • • •

Communication Cost Communication Replication Ordering Operations Naming named : How are logically processes referenced? : data and/or logically shared are How

by threads; by threads; : How are accesses to data ordered and coordinated? : and accesses aredataordered How to Fundamental Fundamental Design Issues : What operations are provided onthesedata : provided are operations What : How are data replicated to reduce communication? reduce to data replicated : How are : How to split between processors and memory?

: Latency, bandwidth, occupancy overhead, bandwidth, : Latency, operations

performed onnameddata;

ordering

67

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 53 Performance (sequential programming model) Contract

• • • • • • • • Compilers and hardware violate other orders without getting caught getting without orders other violate Compilers hardware and (mostly): Rely location on dependences on single order program Sequential Ordering: and Stores Loads Operations: emuniprocessadores) space address (exemploin virtual any Naming:variable name Can Transparent replication in caches in replication Transparent write buffers bypassing, pipeline order, of out Hardware: allocation and register Compiler:reordering – Hardware Hardware (and perhaps compilers) does translation to physical addresses

Sequential Programming Model

dependence order dependence

68

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 54 Simplest Ordering Model: Operations: loads and stores, needed plus those ordering for Naming: process can nameAny anyvariable in shared space

• • • • Again, compilers/hardware can violate orders without getting caught getting without can violate orders compilers/hardwareAgain, synchronization through Additional orders in time (as interleaving Across somethreads: order program sequential Within a process/thread: – Different, more subtle ordering models also possible (discussed later) (discussed possible also models ordering subtle more Different, SAS Programming Model

- sharing)

69

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 57 Event synchronization Mutual exclusion (locks)

• • • • • acessoapenas vez) tenha por um é que o importantenão interessa; (ordem guarantees No ordering Room only at atime personcan that one enter only atatimeone process by be performed data can certain on operations Ensure certain 3 types: main Ordering of events to preserve dependences dependences to preserve events of Ordering – – – – – group global point e.g. producer Passagem de bastão -

to

- point

— > consumer of data Synchronization

70

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 55 Ordering: Ordering: Operations: Explicit communication through Naming: Processes can name private data directly can (or name Can construct global address space:

other other (private processes) data space < Message Passing ProgrammingModel • • • • • • • • • But no direct operations on on these names But operations direct no address process space Process + address within number Mutual inherent exclusion between processes synch ptto can provide Send receive and aprocess within order Program Must processes to namebe able address space toprivate process Receive datafromcopies process address space to another private data from Send transfers No sharedspace address

- > global process space)

send

and receive

71

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp In fact, each interface between layers supports or takes a position on: Prog. model’s position provides constraints/goals for system Let’s Let’s across issues layers see Any set of positions can bemapped to any other by software • • • • • • • Performance issues Performance models programming of contracts can support layers How lower performanceCommunication Replication modelOrdering on names Set operations of Namingmodel Design Issues LayersApply atAll

72

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 55 Alt2: Alt2: independent Hardware supports physical (cada address spaces Alt1: Hardware interface supports Example: Naming and operations in programming model can be directly processador processador pode acessar áreas distintas) físicas space abstração), or translated by compiler, libraries or OS supported by lower levels (uniforme em todos níveis os de • •

Can provide SAS through OS, so interface system/user OS,so in through Can SAS provide layers os no software processadores), v through by hardware Direct support – – – same programming model, different hardware requirements and cost model remote data accesses incur page faults; brought in via page fault handlers v

- to Shared virtual Shared virtual address - p mappings only for data that are local Naming and Operations

shared (global) physical (global)shared physical address space space in programmingmodel

- to - p mappings (comum para todos todos para mappings(comump

73

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 56

Alt2: Alt2: atsys/user Support interface or above in software (almost Alt1: Direct support at hardware interface Example: Implementing Need to examine the andtradeoffsissues at every layer always) • • • • • • and types of operations, costs operations, types of Frequencies and + sincronização) buffers em loads/stores built buffers and (leitura/escrita with on top send/receive and (virtual), SAS provide interfaces lower Alt2.3: Or Choices user/system interface: at buffering) (protection, flexibility inswfor built Send/receive suited) (well transport basic data provides interface Hardware flexibility morefrom benefit But buffering and match Naming and Operations (contd) – – time (setup com OS e execução com HW) Alt2.2: sets OS once/infrequently,up then little sw involvement each Alt2.1: each OS time:expensive

Message Message Passing

74

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 57 SAS Message passing defines semantics of SAS those imposed bysend/receive pairs

• • • • • How programs behave, what they rely on, and hardware implications and on, hardware rely whatthey behave, How programs new ones and valid, learn are tricks old which Needunderstand to multiprocessorsThese in importantmoreare locality or gain parallelism to orders Uniprocessors play with tricks and subtle important very Ordering : How processes see the of: Howprocesses order other processes’ references : except no assumptions onorders across processes

Ordering

75

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 58 SAS Message Passing Uniprocessor: caches itdo automatically Again, depends on naming model Very important for reducing data transfer/communication Obs: communication = entre processos (não equivalente a data transfer)

• • • • • • •

No No explicit renaming, many copies for samename: OS can do it at page level in shared virtual address space, or objects Hardware caches this, do e.g. in shared physical address space A load brings in data transparently, so can replicate transparently Replication is explicit in software above that interface A receive replicates, giving a new name; subsequently use new name Reduce communication with memory naming model at an interface – in is copies uniprocessors, memory “coherence” of natural in hierarchy

naming model at an interface 1.3.3 Replication 1.3.3

coherence

problem

76

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 59 Fundamentally, three characteristics: Performance characteristics determine ofoperations usage at a layer We’ll communicationfocus on or data across transfer nodes Characteristics apply to overall operations, as well as individual If does one processor thing at a time: bandwidth components of asystem, howeversmall latência * nº de operações)

• • • • • But actually more complex in modern systems modernin But complexmore actually Cost Bandwidth Latency operações custosas) makeon this choices based(evitam etc compilers Programmer, 1.3.4 Communication Performance : impact on execution time program of : on execution impact : time taken for an operation an for taken : time : rate of performing operations : of performing rate

1/latency 1/latency = (custo

77

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 60 Suppose application Suppose application performsoperations. 100M What iscost? Delivered bandwidth on application depends oninitiation frequency Internally pipeline depth 10=> bandwidth 100 Mops (portanto) Simple bandwidth: 10 Mops Component performsan operation in 100ns (latência) (quantas (quantas vezes sequência é executada)

• • • • (100E6*50E latency of 50ns other isthe application on cost to op, of result depending before média) (em work 50 ns ofuseful cando if application completo pipeline do uso possível for bound) (se sec gives 1 rate (lower op op / peak count (se usar pipeline) épossível não (100E6*100E bound) gives (upper op 10 secop latency * count efetivo) operação executadacada a200ns (se latency not overall pipeline, by slowest stageof Rate determined – assumes full overlap of latency with useful work, so just issue cost

Simple Example (expl1.2) - 9=5)

- > 10x)

- > bandwitdh bandwitdh = 5Mops >

- >pipeline não >pipeline

-

9=10) 9=10) 78

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 60 But linear model not enough Size needed for half bandwidth (half How quickly it approaches depends on As Transfer time (n) = T asymptotic rate • • • • n

Linear Model ofDataTransferLatency

When When can next transfer be initiated? Can cost be overlapped? vector ops etc tempo de acesso) , bus (T Model useful for message passing (T T Need Need to know how transfer is performed increases, bandwidth approaches 0

= startup; n= bytes; B= bandwidth n 1/2 1/2 = T B 0

* B (ver errata no livro texto) 0

+ n/B

0 = arbitration), pipeline (T - power point):

T 0

0

= latência 1ºbit), memoryaccess (T

BW B/2 0 = encher pipeline) B

BW = = BW = n 1/2

T BT T n B n

0 +n 0

=

n

79

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp pag 61 General General model for data transfer: applies to cache misses too CommCost = frequency * (Commtime Each component along the way has occupancy and delay Overhead and assist occupancy maybe CommTime per message= Overhead + Assist Occupancy + -

63 • • •

livres (assumindo que não há buffers no caminho) Próxima transferência de dados só pode começar se recursos críticos estão Overall occupancy (1/bandwidth) is biggest of occupancies (gargalo) Overall delay is sum of delays Network Network Delay + Size/Bandwidth + Contention Communication Model Cost Communication = o v

+ o

c

+ l + + n/B + T f(n) -

overlap)

or or not

c

80

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Goal of architects: design against frequency and type of operations Replication and communication are deeply related Performance: Organization, latency, bandwidth, overhead, occupancy Functional: Naming,operations and ordering Functional and performance apply issues at all layers fromabove or below that occur at communication abstraction, constrained by tradeoffs • • Management depends model Managementon namingdepends Hardware/software tradeoffs Hardware/software Summary Summary of DesignIssues

81

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Exotic Exotic designs have contributed much,but given way toconvergence Parallel architecture is important thread in evolution of architecture Design decisions Design decisions driven by workload Fundamental design issues • • • • • • • • Integral part of the engineering focus engineering the of part Integral characteristics performance replication, organization, Performance: ordering operations, Functional: naming, issue Key architecturearchitecturalis incommunication Basic processor performance application Push cost and technology, of computing of in mainstream now Multiple level processor At all levels – How How communication is integrated into memory and I/O system on node

- memory architecture is the sameis the architecture memory

Recap - driven driven evaluation

82

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Understanding Understanding parallel programs workloads as Scalable Scalable systems Cache • • • • • • • • Hardware 8) (Ch memory distributed with cacheHardware coherence 7) models(Ch programming and realizing scalability Design for issues, design case logical Physical 6) design, studies (Ch deeper (Ch 5) softwareBasicfor tradeoffs, implications design, logical Methodologies workload for 3) (Ch. architecture interactions with performance:for Programming 2) models(Ch. programming in What majorthey like look – - coherent coherent multiprocessors with centralized shared memory Much more variation, less consensus and greater impact than in sequential - Outline forRestofClass software tradeoffs for scalable coherent SAS (Ch 9) SAS scalable coherent for tradeoffs software

- driven architectural evaluation (Ch. 4) (Ch. evaluation architectural driven

83

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Overall: conceptual foundations and engineering across issues broad Future directions (Ch 12) Latency tolerance (Ch 11) Interconnection networks (Ch 10) range range of scales of design, all which of are important Outline (contd.)

84

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp • • • • • Ridge National Laboratory The upgraded Cray XT4 “Jaguar” (205 teraflop/s) at Oak DOE’s Advanced Computing Center at the University Texas of Sun SunBlade x6420 “Ranger” system (326teraflop/s) at the Texas Laboratory, IBM BlueGene/P (450.3 teraflop/s) at Argonne National DOE’s Lawrence Livermore National Laboratory Blue Gene/L, with a performance of478.2 teraflop/s at DOE’s TOP500 Roadrunner is also one ofthe most energy efficient systems onthe ever to this reach milestone. At the sametime, achieved performance of 1.026 petaflop/s “Roadrunner,” by LANL the after state bird New Mexico of Energy’s Los AlamosNational Laboratory and and named The 1 system, No. built new for theU.S.Departmentby IBM of

Top Top 500 em jun/08 (5 primeiros)

becoming the first –

Austin

85

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Top Top 500 em jul/07:projeções

86

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Top 500em jul/08

87

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 10 9 8 7 6 5 4 3 2 1

(FZJ) Forschungszentrum Juelich DOE/NNSA/LLNL Center/Univ. of Texas Texas Advanced Computing Argonne National Laboratory Tennessee Sciences/Universityof Computational National Institute for DOE/NNSA/LLNL Center/NAS NASA/Ames Research (FZJ) Forschungszentrum Juelich Laboratory Oak Ridge National DOE/NNSA/LANL Top 500em jun/09(10primeiros)

Germany United States United States United States United States United States United States Germany United States United States

Parastation QDR Infiniband/PartecM9/Mellanox R422 JUROPA Dawn Ghz, Infiniband Ranger Blue Gene/PSolution Kraken XT5 BlueGene/L 3.0/2.66 GHz Pleiades JUGENE Jaguar DC 1.8 GHz, VoltaireInfiniband Cluster, Opteron8i 3.2 PowerXCellGhz/ Roadrunner - E2, Intel X5570, 2.93 GHz, Sun E2, Sun Intel Xeon X5570,2.93GHz, -

- Blue Gene/PBlue Solution -

- Cray 2.3GHz XT5QC SunBlade SunBlade x6420, QC2.3

- - SGI ICE Xeon8200EX, QC

Sun Sun Constellation, NovaScale Blue Blue Gene/PSolution

- -

-

eServer Blue GeneBlue Solution eServer

BladeCenter QS22/LS21 Cray XT5 Cray QC2.3GHz

Bull Bull SA IBM Microsystems Sun IBM Cray Inc. IBM SGI IBM Cray Inc. IBM

88

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Top 500em jun/09

89

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Projeções em jun/09 Projeções em

90

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 10 9 8 7 6 5 4 3 2 1

Red Sky Intrepid BlueGene/L HD4870 2,Infiniband Tianhe 3.0/Xeon Westmere 2.93Ghz,Infiniband Pleiades JUGENE Kraken XT5 Ghz / Opteron DC1.8 GHz, Voltaire Infiniband Roadrunner C2050 (China) GPU Nebulae Jaguar - - 1

- Cray XT5 - - -

-

Blue Gene/P Solution

- SGI Altix Altix ICE 8200EX/8400EX,SGI XeonQC HT TC3600Dawning Blade, Intel X5650,

Sun Blade Infiniband 2.93 Ghz, x6275, XeonX55xx

NUDT TH NUDT Blue Gene/P Solution - - -

BladeCenter BladeCenter Cluster, QS22/LS21 PowerXCell 8i 3.2

eServer eServer Blue Gene Solution Cray XT5 Top em jun/2010 - HEOpteron Six Core 2.6GHz

-

1 Cluster, 1 Cluster, E5540/E5450, Xeon ATI - HE Opteron Six Six HE Opteron Core 2.6GHz

91

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Top 500em jun/2010

92

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 10 9 8 7 6 5 4 3 2 1

DOE/NNSA/LANL (CEA) Atomique a l'Energie Commissariat States DOE/SC/LBNL/NERSC States United Center/NAS Research NASA/Ames States United DOE/NNSA/LANL/S Technology of Institute Tokyo Center, GSIC (NSCS) Shenzhen Cent Supercomputing National Laboratory National Ridge DOE/SC/Oak Tianjin in Center Supercomputing National (AICS) Science Computational for Institute RIKEN Advanced Site

France China

United States United

NL Japan

China United States United

Top em jun/2011 United United

re in re in

Japan

3.2 Ghz / Opteron DC / Opteron Ghz3.2 Roadrunner SA Tera Hopper Infiniband Ghz, 2.93 5570/5670 3.0/Xeon Pleiades Cielo Linux/Windows GPU, Nvidia 2.0 TSUBAME GPU C2050 Nebulae Inc. Jaguar FTGPU, Tianhe Fujitsu Tofu 2.0GHz, VIIIfx SPARC64 computer, K Computer

- 100 100 - Cray XE6 8 XE6 Cray

- - 1A - Cray XT5 Cray Cray XE6 12 XE6 Cray - - - 1000 8C 1000 - Tesla X5650, NVidia Intel Blade, TC3600 Dawning

SGI Altix ICE 8200EX/8400EX, Xeon HT QC HT Xeon ICE 8200EX/8400EX, SGI Altix Bull bullx super bullx Bull - NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA NVIDIA 6C, 2.93Ghz X5670 MPP, TH NUDT

- BladeCenter QS22/LS21 Cluster, PowerXCell 8i 8i Cluster, PowerXCell QS22/LS21 BladeCenter - HP ProLiant SL390s G7 Xeon 6C X5670, X5670, Xeon 6C G7 SL390s ProLiant HP

Dawning - - HE Opteron 6 Opteron HE core 2.4 GHz 2.4 core - 1.8 GHz, Voltaire Infiniband Voltaire GHz, 1.8 core 2.1 GHz 2.1 core NUDT -

node S6010/S6030

- core 2.6 GHz 2.6 core NEC/HP

InterConnect Cray Inc. Cray

Cray Inc. Cray SGI

Cray Cray

Bull Bull IBM

93

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Top 500em jun/2011

94

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Projected Performance @ 2011 Projected Performance

95

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp 10 9 8 7 6 5 4 3 2 1 Rank

China National Supercomputing Centre in Shenzhen(NSCS) France CEA/TGCC Germany Forschungszentrum Italy CINECA United States DOE/SC/Oak Ridge NationalLaboratory China National Germany Leibniz United States DOE/SC/Argonne NationalLaboratory Japan (AICS) RIKEN Advanced Institute for Computational Science United DOE/NNSA/LLNL Site

States Rechenzentrum

Supercomputing Center in Tianjin

- GENCI

Juelich

TOP500 jun/2012

(FZJ)

Dawning 2.66GHz, Nebulae Bull Infiniband Curie IBM JuQUEEN IBM Fermi Cray Gemini Jaguar NUDT NVIDIA 2050 Tianhe IBM 2.70GHz, SuperMUC IBM Mira Fujitsu K IBM Sequoia Computer computer

- Inc.

thin BlueGene

-

-

interconnect - 1A BlueGene

Cray - -

Infiniband Infiniband

QDR

Dawning BlueGene

nodes - - , SPARC64

-

BlueGene NUDTYH MPP,

iDataPlex

XK6, XK6,

/Q, /Q, PowerBQC 16C1.60GHz,

/Q, /Q, PowerBQC 16C 1.60GHz, -

Bullx

, , NVIDIA 2090 TC3600 Opteron QDR, NVIDIA 2050 FDR /Q, /Q, PowerBQC 16C1.60 GHz, /Q, /Q, PowerBQC 16C1.60GHz,

VIIIfx DX360M4

B510,

Blade 6274 16C2.200GHz, 2.0GHz, Xeon Xeon ,

Xeon System, X5670 6C 2.93 GHz,

E5 Tofu

- E5 2680 8C2.700GHz,

interconnect - Xeon 2680 8C Custom Custom

X5650 6C Cray Custom Custom

96

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp TOP500 2012

97

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Roadrunner, the first system to break the petaflop barrier in June 2008, is now listed at No 19. The largest U.S. system in the previous list, the upgraded Jaguar, installed at the Oak Ridge National The Chinese Tianhe The most powerful system in Europe and No.4 on the List is SuperMUC, an IBM iDataplex system with Intel A second BlueGene/Q system (Mira) installed at Argonne National Laboratory is now at No. 3 with 8.15 Fujitsu’s “K Computer” installed at the RIKEN Advanced Institute for Computational Science (AICS) in Kobe, Sequoia, an IBM BlueGene/Q system is the No. 1 system on the TOP500. It was first delivered to the Laboratory, is holding on to the No. 6 spot with 1.94 Pflop/s Linpack performance. Pflop/s Linpack performance. Sandybridge installed at Leibniz Rechenzentrum in Germany. Petaflop/s on the Linpack using 786,432 cores. 705,024 SPARC64processing cores. Japan, is now the No. 2 system on the TOP500 list with10.51 Pflop/s on the Linpack benchmark using the list consuming a total of 7.89. on the Linpack benchmark using 1,572,864 cores. Sequoia is one of the most energy efficient systems on Lawrence Livermore National Laboratory in 2011and now full deployed with an impressive 16.32 Petaflop/s - 1A system, the No. 1 on the TOP500 in November 2010 is now the No. 5 with 2.57 TOP500 2012

-

Highlights

98

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp Italy makes a first debut in the TOP10 withan IBM BlueGene/Q system installed at CINECA.The system is IBM’s BlueGene/Q is now the most popular system in the TOP10 with4 entries including the No. 1 and No. 3. 57 systems use accelerators or co Already 74.8 percent of the systems use processors withsix or more cores. Intel’s Westmere processors increased their presence in the list with 246 systems, (240 in 2011). Intel continues to provide the processors for the largest share (74.2 percent) of TOP500 systems. The number of systems installed in China decreased from 74 in the previous to 68 in the current list. China The twoChinese systems at No. 5 and No. 10 and the Japanese Tsubame 2.0 system at No. 14 are all using There are 20 petaflop/s systems in the TOP500 List at at position No. 7 in the List with 1.69 Pflop/s Linpack performance. use Cell processors, and two use ATI Radeon and a one newsystem with Intel MIC technology. the No. 2 position in performance share. still holds the No. 2 position as a user of HPC, ahead of Japan, UK, France, and Germany. Japan holds Processor technology. NVIDIA GPUs to accelerate computation and a total of 57 systems on the list are using Accelerator/Co TOP500 2012

- processors (up from 39 sixmonth ago), 52 of these use NVIDIA chips, two

-

Highlights

-

99

Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp TOP Green jun/2012

100 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

Distributed memoryor or Symmetric MemoryProcessors (SMPs). Uniform Memory (UMA) Access Model Non Mem M Models Models of Shared

- uniform Memory (NUMA)Access Model P $ Mem

Pr Inter ocessor Network connect Mem M

P $ Mem

Inter connect Pr

I/O ctrl ocessor M

-

Memory Memory Multiprocessors

I/O ctrl

P $

devices I/O Cache - Only Only Memory Architecture (COMA) D: Cache directory D: Cache C: Cache MemoryM: P: Processor network Bus, Crossbar, Multistage Interconnect: D C P

Network

D C P

D C P

101 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

Performance Performance flatlining Burton Smith Sutter,and Herb Lance Hammond, Kunle Olukotun, of Figure courtesy

102 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

Previsão clock decrescimentodo

103 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

Previsão clock (ITRS) decrescimentodo

104 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

New CW Old CW New CW Old CW New CW Old CW New Wisdom Conventional Old Wisdom Conventional for 4 clocks to DRAM multiply) cycles memory,(200 clock innovation (Out on) turn affordto can than (Can morechip on put •

Uniprocessor performance now 2X / 5(?) yrs / 5(?) now 2X Uniprocessorperformance Sea change in chip design: “cores” multiple design: in chip change Sea – More simpler processors are more power efficient : Uniprocessor performance 2X / 1.5 2X 1.5 / yrs Uniprocessor : performance slow, fast access are Memoryis Multiplies : compilers, via Parallelism Level Instruction increasing Sufficiently : (2X processors per chip (2X processors 2 years) chip ~ / per Changing Conventional Wisdom : Power Wall + ILP Wall + Memory Wall = Brick Wall Brick = Wall Memory + ILP Wall + Power Wall : slow, fast Memorymultiplies wall” “Memory : for HWmoreILP returns on ofdiminishing law “ILPwall” : - of - order, speculation, …) VLIW, speculation, order, : Power is free, Transistors expensive Transistors Power isfree, : Fonte: Doug L Hoffman, Patterson 2008 Patterson Doug LFonte: Hoffman, : “Power wall” Power expensive, Xtors Power free wall” expensive, “Power :

105 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp

  (1971): 4 Intel (1971): 4004 10 micron PMOS, 11 mm11 10 micronPMOS, MHz,2312 0.4 transistors, = 2312 RISC II+FPU+Icache+Dcache RISC = 2312 mm 125 60 mm NMOS, 3 micron MHz, 3 transistors, 40,760 pipeline, 32 RISC II (1983): • – – –

(Ivan Sutherland@ / Berkeley)Sun ProximityviaCommunication capacitivecoupling at > 1 TB/s ? Cachesor 1 viaDRAMtransistor SRAM( RISC II to shrinks0.02 mm ~ Processor is the new transistor? 2 chip, 0.065 micron CMOS micron 0.065 chip, Sea Change inChipDesign - bit processor, - bit, 5 stage stage 5 bit, 2 chip

2 chip

2

at nm 65

www.t - ram.com ) ? )

106 Adaptado dos slides da editora por Mario Côrtes – IC/Unicamp