Parallel Panorama Amdahl's Law Early High Performance Machines

Amdahl’s Law Parallel computation has limited benefit ... • A computation taking time T, (x-1)/x of which can be parallelized, never runs faster than T/x Parallel Panorama • Let T be 24 hours, let x = 10 • 9/10 can be parallelized, 1/10 cannot Parallel computation appears to be a • Suppose the parallel part runs in 0 time: 2.4 hrs straightforward idea, but it has not • Amdahl’s Law predates most parallel efforts ... turned out to be as easy as everyone why pursue parallelism? initially thinks. Today, an overview of • New algorithms Why would one want to successes and failures preserve legacy code?

1  Copyright, Lawrence Snyder, 1999 2  Copyright, Lawrence Snyder, 1999

Early High Performance Machines Early Parallel Machines • The first successful parallel computer was • High Performance computing has always implied the use of pipelining Illiac IV built at the Univeristy of Illinois • IBM Stretch, S/360 Model 91, Cray 1, Cyber 205 • 64 processors (1/4 of the original design) built • Pipelining breaks operations into small steps • Constructed in the preLSI days, hardware was performed in “assembly line” fashion both expensive and large • The size t of the longest step determines rate • A SIMD computer with a shared memory few • Operations are started every t time units registers per node • The most common application of pipelining became “vector instructions” in which operations could be • Though it was tough to program, NASA used it applied to all elements of a vector Flynn’s Taxonomy: Related term: • Pipelining is used extensively in processor design and SIMD -- single instruction multiple data SPMD -- single program parallelism is a more effective way to achieve high perf MIMD -- multiple instructions multiple data multiple data

3  Copyright, Lawrence Snyder, 1999 4  Copyright, Lawrence Snyder, 1999

SIMD Is Simply Too Rigid VLSI Revolution Aided Parallel Computing • SIMD architectures have two advantages over • Price/density advances in Si => multiprocessor MIMD architectures computers were feasible • There is no program memory -- smaller footprint • SIMD computers continued to reign for • It is possible to synchronize very fast ... like on the next technical reasons instruction • Memory was still relatively expensive • SIMD has liabilities, too ... • Logic was still not dense enough for a high performance • Performance: if a>0 then ... else ... node • Processor model is the virtual processor model, and • It’s how most architects were thinking though there are more processors than on a typical MIMD machine there is never 1pt/proc • Ken Batcher developed the Massively Parallel • Ancillary: instr. distribution limits clock, hard to share, Procesor (MPP) for NASA with 16K procs single pt of failure, etc • Danny Hillis built two machines CM-1,-2 scaling to 64K • SIMD not a contender • MASPAR also sold a successful SIMD machine

5  Copyright, Lawrence Snyder, 1999 6  Copyright, Lawrence Snyder, 1999 Denelcor HEP Effects of VLSI on Early MIMD Machines

• Designed and built by Burton Smith • Single chip processors spawned flury of design in nonshared memory domain • Allowed multiple instructions to “be in the air” • ZMOB, ring of 30, 8-bit procs, Maryland at one time • Pringle, 8x8 config mesh, 8-bit procs, Purdue CS • Fetch next instruction, update PC • Cosmic Cube, 6-cube, 16-bit procs, Caltech -- • Suspend, check to see if others need attention Intel commercialized this as iPSC, iPSC/2 • Decode instruction, computing EA of mem ref(s), issue mem ref(s) • Quickly, 32-bit MIMD machines arrived • Suspend, check to see if others need attention • Sequent sold an elegant shared bus machine • Evaluate instruction • BBN Butterfly was a large shared memory machine • Great for multithreading, or for good ILP • Two different approaches to choherency

7  Copyright, Lawrence Snyder, 1999 8  Copyright, Lawrence Snyder, 1999

PRAM Machine Model Research Parallel Machines

• Parallel Random Access Machine ... simplest • University of Illinois developed the Cedar generalization to a parallel machine machine, a 32 processor shared memory -- a • Synchronous, “unit-cost” shared memory -- it dance hall architecture is unrealistic but it drove intensive research

P P P P P P P P P M Memory P M P Interconnection M • Theoretically interesting as a way to discover Network the limits of parallelism P M P M Varieties: CRCW EREW, CREW, ... P M

9  Copyright, Lawrence Snyder, 1999 10  Copyright, Lawrence Snyder, 1999

Ultra Computer and RP-3 Combining Switch • The Ultracomputer developed at NYU had a • Combining switch design solves problems like cute idea in the interconnection network: a “bank conflicts”, busy waiting on combining switch fetch_and_add A,V return location A, semaphores... 25 P 3 add in V and update • The theory was that rather than avoiding bad 4 P 1 28 25 M operations, do them with impunity ... there was 21 20 P 4 M 5 9 even a programming methodology based on it P 20 31 20 1 20 M 46 P 5 29 22 M Unfortunately, combining doesn’t work beyond 14 P 9 37 37 M 64 processors based on both analysis and P 2 35 29 8 M experimentation P 6 11 29  Copyright, Lawrence Snyder, 1999 12  Copyright, Lawrence Snyder, 1999 Early Programming Approaches Memory and MIMD Computers

• Every machine has powerful “features” that the • It was easier to invision an MIMD machine programmer should exploit than to build one ... the problem is memory • Program “close” to the machine • We are accustomed to a flat, global memory Memory Coherency • Parallelizing FORTRAN compilers will be P1 P2 P3 along shortly P1 reads a into cache P3 reads a into cache a: 4 5 a: a: 4 P1 updates a to 5, wt This viewpoint was nearly disasterous ... -- P3 has stale data -- Writeback is worse a: 4 5

13  Copyright, Lawrence Snyder, 1999 14  Copyright, Lawrence Snyder, 1999

Nonshared Memory Avoids Problem Shared vs Nonshared Memory • The memory organization is crucial in parallel • No shared memory implies no coherency prob computing. That is an incontrovertible fact • Put a greater burden on the programmer, sw • Has engendered many wars, but one wonders why Conventional Wisdom: Its harder • Clearly, there must be a program, accessible locally to the to program nonshared memory processor, i.e. MIMD • There must be local data, stack, etc. • A mid-point design is non-coherent shared • These are referenced frequently and must be fast address space • Programmers should not have to know or care • If every processor has the same amount of because a decent programming language can give a memory, the xth power of 2, then interpret global view of a computation without any mention of address bits left of x as processor number memory organization • Many languages give a global view • P0 has the low addresses, P1 next lowest ... • ZPL’s accomplishment is it maps easily onto any memory organization 15  Copyright, Lawrence Snyder, 1999 16  Copyright, Lawrence Snyder, 1999

Another Round of Architecture iWARP • The late 80’s was a rich period of machine design • iWARP (HT Kung, CMU/Intel) -- fast • Driving force was the VLSI revolution communication and synchronization; ideal for • Back pressure came from the “killer micros” Cannon’s algorithm • Architecture design focused on “problems” Culmination of several research projects: PSC • The problem the architect thought was the most and WARP pressing varied by the background of the architect • Examples were low latency communication, fast synchronization, coherent shared memory, scalability, ...

17  Copyright, Lawrence Snyder, 1999 18  Copyright, Lawrence Snyder, 1999 J-Machine KSR

• J-Machine (Bill Dally, MIT/Intel) -- fast • Kendall Square Research -- cache-only communication, large scalability, 3D mesh machine allowing data to move to where it is needed; ring communication structure

3D design scales in physical space with minimum path lengths

Raised many intresting technical questions ...

19  Copyright, Lawrence Snyder, 1999 20  Copyright, Lawrence Snyder, 1999

DASH Recent Commercial Machines • DASH (John Hennessy, Stanford/SGI) -- true A variety of large machines have been deployed in distributed cache coherence recent years • IBM SP-2, nonshared memory • CM-5, nonshared memory Significant research project with impact in many areas • Cray T3D and T3E, shared address space ... will study ideas of design • SGI Origin 2000, coherent shared memory • Will the rich architectural diversity of the past continue, or will all parallel machine finally look alike?

21  Copyright, Lawrence Snyder, 1999 22  Copyright, Lawrence Snyder, 1999

Programming Approaches Next Strategy to Save Legacy Codes ... • “Automatic Parallelization” of Fortran • As automatic parallelization foundered, adding • Preserves the investment in the “legacy codes” “directives” or extending existing languages with new • Replaces programmer effort with compiler effort constructs came into fashion • Should be as successful as vectorization • Annotating a program can take as much (more?) effort than rewritting • Has been demonstrated to achieve 0.9-10 fold speedup, with 2.5 being typical -- Amdahl’s Law • HPF, HPC, HPC++, CC++, pC++, Split C, etc • Alternative Languages • Extending an existing language requires that its • Functional -- SISAL, Id Nouveau, etc. semantics be preserved; parallel extensions are at • Logic -- Prolog odds with sequential semantics • These approaches failed ... the only successful • Approach has failed ... programs are not easily programmers have programmed close to their transformed: parallelism => paradigm shift from machine; vectorized Cray Fortran continued to reign sequential

23  Copyright, Lawrence Snyder, 1999 24  Copyright, Lawrence Snyder, 1999 The prevailing technique Message Passing State of Parallel Computing • Message passing libraries provide standardized facilties for programmers to define a parallel • Many companies thought parallel computing was easy implementation of their computation ... They're gone now ... • Uses existing languages, Fortran, C, etc. => save legacy • SGI/Cray, IBM, HP, Sun make serious parallel • Interface is standard across machines computers • Lowest common denominator ... works on shared and • Seattle's Tera Computer Inc struggles to introduce a distributed memory machines new parallel architecture, MTA • MP programming is difficult, subtle, error-prone; • The basic reality of large computers has changed: programmer implements paradigm shift Servers drive the market, not speed-freaks • Message passing embeds machine assumptions in • The DoE's ASCI program pushes the envelope code; not very portable As many as a dozen mp • SMP’s are ubiquitous libs proposed, PVM, MPI are only contenders

25  Copyright, Lawrence Snyder, 1999 26  Copyright, Lawrence Snyder, 1999

Budget Parallelism Applications • Traditionally, NASA, DoD, DoE labs and their • “Rolling your own” parallel computer with workstations on a network (Beowulf) is popular contractors have been big iron users •CFD • This is simple and cost effective, and the machines The ability to run legacy code, can be used as workstations during the business • Structural dusty decks, can be significant • “Bomb Codes” hours • What are the impediments? • A huge early success was Shell Oil’s seismic • Nonshared memory, nonshared address space modelling code developed by Don Heller -- • Must be programmed w/ msg passing or ZPL parallelism that made money • As incubator for new applications, Beowulfs do • IBM did circuit simulation on a custom SIMD not promote the use of shared memory • Many claims were made but actual working Everything in parallelism seems to come down to programming parallelism was rare in 80s

27  Copyright, Lawrence Snyder, 1999 28  Copyright, Lawrence Snyder, 1999

Government Programs HPCC Over the years numerous efforts by funding Took on the “whole” problem by considering hw, agencies have tried to jump-start high sw and applications involving “real” users performance parallel computing • Compared to predecessors, it was well planned • DARPA, NSF, ONR, AFOSR, DoE, NASA, ... • Attempt at interagency coordination • Some have been criticized as corporate welfare • “Real” users with science or engineeering apps • Initial thrust was on hardware had to collaborate with “real” computer scientists Companies have invested heavily, too • Introduced the concept of a “grand challenge” The most significant federal effort was the High problem, a computation which if it received Performance Computer and Communication real performance would cause better science Initiative (HPCC) in early ‘90s to be done

29  Copyright, Lawrence Snyder, 1999 30  Copyright, Lawrence Snyder, 1999 Grand Challenge Problems HPCC Legacy Booklets with snappy graphics enumerate these Ironically, the HPCC initiative left many “real” Example classes scientists and engineers with view that • Galaxy simulation computation is a third piller in research, along • Molecular dynamics with theory and experimentation • Climate modelling • Protein folding It also convinced most people of the obvious: E •CFD Its the Economy,AR Stupid FTW SO • Circuit simulation As things stand, most users are writing message • Structural analysis • Seismic modelling passing programs at considerable effort ... and many, many variations HPCC’s main error -- It promised success in 5 years

31  Copyright, Lawrence Snyder, 1999 32  Copyright, Lawrence Snyder, 1999