Amdahl’s Law Parallel computation has limited benefit ... • A computation taking time T, (x-1)/x of which can be parallelized, never runs faster than T/x Parallel Panorama • Let T be 24 hours, let x = 10 • 9/10 can be parallelized, 1/10 cannot Parallel computation appears to be a • Suppose the parallel part runs in 0 time: 2.4 hrs straightforward idea, but it has not • Amdahl’s Law predates most parallel efforts ... turned out to be as easy as everyone why pursue parallelism? initially thinks. Today, an overview of • New algorithms Why would one want to successes and failures preserve legacy code?
1 Copyright, Lawrence Snyder, 1999 2 Copyright, Lawrence Snyder, 1999
Early High Performance Machines Early Parallel Machines • The first successful parallel computer was • High Performance computing has always implied the use of pipelining Illiac IV built at the Univeristy of Illinois • IBM Stretch, S/360 Model 91, Cray 1, Cyber 205 • 64 processors (1/4 of the original design) built • Pipelining breaks operations into small steps • Constructed in the preLSI days, hardware was performed in “assembly line” fashion both expensive and large • The size t of the longest step determines rate • A SIMD computer with a shared memory few • Operations are started every t time units registers per node • The most common application of pipelining became “vector instructions” in which operations could be • Though it was tough to program, NASA used it applied to all elements of a vector Flynn’s Taxonomy: Related term: • Pipelining is used extensively in processor design and SIMD -- single instruction multiple data SPMD -- single program parallelism is a more effective way to achieve high perf MIMD -- multiple instructions multiple data multiple data
3 Copyright, Lawrence Snyder, 1999 4 Copyright, Lawrence Snyder, 1999
SIMD Is Simply Too Rigid VLSI Revolution Aided Parallel Computing • SIMD architectures have two advantages over • Price/density advances in Si => multiprocessor MIMD architectures computers were feasible • There is no program memory -- smaller footprint • SIMD computers continued to reign for • It is possible to synchronize very fast ... like on the next technical reasons instruction • Memory was still relatively expensive • SIMD has liabilities, too ... • Logic was still not dense enough for a high performance • Performance: if a>0 then ... else ... node • Processor model is the virtual processor model, and • It’s how most architects were thinking though there are more processors than on a typical MIMD machine there is never 1pt/proc • Ken Batcher developed the Massively Parallel • Ancillary: instr. distribution limits clock, hard to share, Procesor (MPP) for NASA with 16K procs single pt of failure, etc • Danny Hillis built two machines CM-1,-2 scaling to 64K • SIMD not a contender • MASPAR also sold a successful SIMD machine
5 Copyright, Lawrence Snyder, 1999 6 Copyright, Lawrence Snyder, 1999 Denelcor HEP Effects of VLSI on Early MIMD Machines
• Designed and built by Burton Smith • Single chip processors spawned flury of design in nonshared memory domain • Allowed multiple instructions to “be in the air” • ZMOB, ring of 30, 8-bit procs, Maryland at one time • Pringle, 8x8 config mesh, 8-bit procs, Purdue CS • Fetch next instruction, update PC • Cosmic Cube, 6-cube, 16-bit procs, Caltech -- • Suspend, check to see if others need attention Intel commercialized this as iPSC, iPSC/2 • Decode instruction, computing EA of mem ref(s), issue mem ref(s) • Quickly, 32-bit MIMD machines arrived • Suspend, check to see if others need attention • Sequent sold an elegant shared bus machine • Evaluate instruction • BBN Butterfly was a large shared memory machine • Great for multithreading, or for good ILP • Two different approaches to choherency
7 Copyright, Lawrence Snyder, 1999 8 Copyright, Lawrence Snyder, 1999
PRAM Machine Model Research Parallel Machines
• Parallel Random Access Machine ... simplest • University of Illinois developed the Cedar generalization to a parallel machine machine, a 32 processor shared memory -- a • Synchronous, “unit-cost” shared memory -- it dance hall architecture is unrealistic but it drove intensive research
P P P P P P P P P M Memory P M P Interconnection M • Theoretically interesting as a way to discover Network the limits of parallelism P M P M Varieties: CRCW EREW, CREW, ... P M
9 Copyright, Lawrence Snyder, 1999 10 Copyright, Lawrence Snyder, 1999
Ultra Computer and RP-3 Combining Switch • The Ultracomputer developed at NYU had a • Combining switch design solves problems like cute idea in the interconnection network: a “bank conflicts”, busy waiting on combining switch fetch_and_add A,V return location A, semaphores... 25 P 3 add in V and update • The theory was that rather than avoiding bad 4 P 1 28 25 M operations, do them with impunity ... there was 21 20 P 4 M 5 9 even a programming methodology based on it P 20 31 20 1 20 M 46 P 5 29 22 M Unfortunately, combining doesn’t work beyond 14 P 9 37 37 M 64 processors based on both analysis and P 2 35 29 8 M experimentation P 6 11 29 Copyright, Lawrence Snyder, 1999 12 Copyright, Lawrence Snyder, 1999 Early Programming Approaches Memory and MIMD Computers
• Every machine has powerful “features” that the • It was easier to invision an MIMD machine programmer should exploit than to build one ... the problem is memory • Program “close” to the machine • We are accustomed to a flat, global memory Memory Coherency • Parallelizing FORTRAN compilers will be P1 P2 P3 along shortly P1 reads a into cache P3 reads a into cache a: 4 5 a: a: 4 P1 updates a to 5, wt This viewpoint was nearly disasterous ... -- P3 has stale data -- Writeback is worse a: 4 5
13 Copyright, Lawrence Snyder, 1999 14 Copyright, Lawrence Snyder, 1999
Nonshared Memory Avoids Problem Shared vs Nonshared Memory • The memory organization is crucial in parallel • No shared memory implies no coherency prob computing. That is an incontrovertible fact • Put a greater burden on the programmer, sw • Has engendered many wars, but one wonders why Conventional Wisdom: Its harder • Clearly, there must be a program, accessible locally to the to program nonshared memory processor, i.e. MIMD • There must be local data, stack, etc. • A mid-point design is non-coherent shared • These are referenced frequently and must be fast address space • Programmers should not have to know or care • If every processor has the same amount of because a decent programming language can give a memory, the xth power of 2, then interpret global view of a computation without any mention of address bits left of x as processor number memory organization • Many languages give a global view • P0 has the low addresses, P1 next lowest ... • ZPL’s accomplishment is it maps easily onto any memory organization 15 Copyright, Lawrence Snyder, 1999 16 Copyright, Lawrence Snyder, 1999
Another Round of Architecture iWARP • The late 80’s was a rich period of machine design • iWARP (HT Kung, CMU/Intel) -- fast • Driving force was the VLSI revolution communication and synchronization; ideal for • Back pressure came from the “killer micros” Cannon’s algorithm • Architecture design focused on “problems” Culmination of several research projects: PSC • The problem the architect thought was the most and WARP pressing varied by the background of the architect • Examples were low latency communication, fast synchronization, coherent shared memory, scalability, ...
17 Copyright, Lawrence Snyder, 1999 18 Copyright, Lawrence Snyder, 1999 J-Machine KSR
• J-Machine (Bill Dally, MIT/Intel) -- fast • Kendall Square Research -- cache-only communication, large scalability, 3D mesh machine allowing data to move to where it is needed; ring communication structure
3D design scales in physical space with minimum path lengths
Raised many intresting technical questions ...
19 Copyright, Lawrence Snyder, 1999 20 Copyright, Lawrence Snyder, 1999
DASH Recent Commercial Machines • DASH (John Hennessy, Stanford/SGI) -- true A variety of large machines have been deployed in distributed cache coherence recent years • IBM SP-2, nonshared memory • CM-5, nonshared memory Significant research project with impact in many areas • Cray T3D and T3E, shared address space ... will study ideas of design • SGI Origin 2000, coherent shared memory • Will the rich architectural diversity of the past continue, or will all parallel machine finally look alike?
21 Copyright, Lawrence Snyder, 1999 22 Copyright, Lawrence Snyder, 1999
Programming Approaches Next Strategy to Save Legacy Codes ... • “Automatic Parallelization” of Fortran • As automatic parallelization foundered, adding • Preserves the investment in the “legacy codes” “directives” or extending existing languages with new • Replaces programmer effort with compiler effort constructs came into fashion • Should be as successful as vectorization • Annotating a program can take as much (more?) effort than rewritting • Has been demonstrated to achieve 0.9-10 fold speedup, with 2.5 being typical -- Amdahl’s Law • HPF, HPC, HPC++, CC++, pC++, Split C, etc • Alternative Languages • Extending an existing language requires that its • Functional -- SISAL, Id Nouveau, etc. semantics be preserved; parallel extensions are at • Logic -- Prolog odds with sequential semantics • These approaches failed ... the only successful • Approach has failed ... programs are not easily programmers have programmed close to their transformed: parallelism => paradigm shift from machine; vectorized Cray Fortran continued to reign sequential
23 Copyright, Lawrence Snyder, 1999 24 Copyright, Lawrence Snyder, 1999 The prevailing technique Message Passing State of Parallel Computing • Message passing libraries provide standardized facilties for programmers to define a parallel • Many companies thought parallel computing was easy implementation of their computation ... They're gone now ... • Uses existing languages, Fortran, C, etc. => save legacy • SGI/Cray, IBM, HP, Sun make serious parallel • Interface is standard across machines computers • Lowest common denominator ... works on shared and • Seattle's Tera Computer Inc struggles to introduce a distributed memory machines new parallel architecture, MTA • MP programming is difficult, subtle, error-prone; • The basic reality of large computers has changed: programmer implements paradigm shift Servers drive the market, not speed-freaks • Message passing embeds machine assumptions in • The DoE's ASCI program pushes the envelope code; not very portable As many as a dozen mp • SMP’s are ubiquitous libs proposed, PVM, MPI are only contenders
25 Copyright, Lawrence Snyder, 1999 26 Copyright, Lawrence Snyder, 1999
Budget Parallelism Applications • Traditionally, NASA, DoD, DoE labs and their • “Rolling your own” parallel computer with workstations on a network (Beowulf) is popular contractors have been big iron users •CFD • This is simple and cost effective, and the machines The ability to run legacy code, can be used as workstations during the business • Structural dusty decks, can be significant • “Bomb Codes” hours • What are the impediments? • A huge early success was Shell Oil’s seismic • Nonshared memory, nonshared address space modelling code developed by Don Heller -- • Must be programmed w/ msg passing or ZPL parallelism that made money • As incubator for new applications, Beowulfs do • IBM did circuit simulation on a custom SIMD not promote the use of shared memory • Many claims were made but actual working Everything in parallelism seems to come down to programming parallelism was rare in 80s
27 Copyright, Lawrence Snyder, 1999 28 Copyright, Lawrence Snyder, 1999
Government Programs HPCC Over the years numerous efforts by funding Took on the “whole” problem by considering hw, agencies have tried to jump-start high sw and applications involving “real” users performance parallel computing • Compared to predecessors, it was well planned • DARPA, NSF, ONR, AFOSR, DoE, NASA, ... • Attempt at interagency coordination • Some have been criticized as corporate welfare • “Real” users with science or engineeering apps • Initial thrust was on hardware had to collaborate with “real” computer scientists Companies have invested heavily, too • Introduced the concept of a “grand challenge” The most significant federal effort was the High problem, a computation which if it received Performance Computer and Communication real performance would cause better science Initiative (HPCC) in early ‘90s to be done
29 Copyright, Lawrence Snyder, 1999 30 Copyright, Lawrence Snyder, 1999 Grand Challenge Problems HPCC Legacy Booklets with snappy graphics enumerate these Ironically, the HPCC initiative left many “real” Example classes scientists and engineers with view that • Galaxy simulation computation is a third piller in research, along • Molecular dynamics with theory and experimentation • Climate modelling • Protein folding It also convinced most people of the obvious: E •CFD Its the Economy,AR Stupid FTW SO • Circuit simulation As things stand, most users are writing message • Structural analysis • Seismic modelling passing programs at considerable effort ... and many, many variations HPCC’s main error -- It promised success in 5 years
31 Copyright, Lawrence Snyder, 1999 32 Copyright, Lawrence Snyder, 1999