Why Parallel Parallel Processors the greed for speed is a permanent malady ❏ The high end requires this approach 2 basic options: • DOE’s ASCI program for example ❏ Build a faster uniprocessor ❏ Advantages • advantages • leverage off the sweet spot technology • programs don’t need to change • huge partially unexplored set of options • compilers may need to change to take advantage of intra-CPU parallelism • disadvantages ❏ Disadvantages • improved CPU performance is very costly - we already see diminishing • software - optimized balance and change are required returns • very large memories are slow • overheads - a whole new set of organizational disasters are now possible ❏ Parallel Processors • today implemented as an ensemble of microprocessors • SAN style interconnect • large variation in how memory is treated

University of Utah 1 CS6810 University of Utah 2 CS6810 School of Computing School of Computing

Types of Parallelism Historical Perspective Note: many overlaps Table 1: Technology and Software and Representative Generation • lookahead & pipelining Architecture Applications Systems • vectorization First Vacuum tubes and relay Machine language ENIAC (1945 - 1954) memories - simple PC and Single user Princeton IAS • concurrency & simultaneity ACC Programmed I/O IBM 701 Second Discrete transistor Fortran & Cobol IBM 7090 • data and control parallelism (1955 - 1964) Core Memory Subroutine libraries CDC 1604 Floating point arith. Batch processing OS Univac LARC • partitioning & specialization I/O Processors Burroughs B5500 Third SSI and MSI IC’s More HLL’s IBM 360/370 • interleaving & overlapping of physical subsystems (1965 - 1974) microprogramming Multiprogramming CDC 6600 pipelining, , and loo- and Timesharing OS TI ASC • multiplicity & replication kahead Protection and file PDP-88 system capability • time & space sharing Fourth LSI/VLSI processors, semi- Multiprocessor OS, parallel VAX 9000 (1975 - 1990) conductor memory, vector languages, multiuser Cray X-MP • multitasking & multiprogramming supercomputers, applications FPS T2000 • multi-threading multicomputers IBM 3090 Fifth ULSI/VHSIC processors, MPP, grand challenge IBM SP • distributed computing - for speed or availability (1991 - present) memory, switches. High applications, distributed SGI Origin density packages and and heterogeneous process- Intel ASCI Red scalable architectures ing, I?O becomes real

University of Utah 3 CS6810 University of Utah 4 CS6810 School of Computing School of Computing What changes when you get more than 1? Inter-PE Communication everything is the easy answer! software perspective 2 areas deserve special attention ❏ Implicit via memory ❏ Communication • distinction of local vs. remote • 2 aspects always are of concern • implies some shared memory • latency & bandwidth • sharing model and access model must be consistent • before - I/O meant disk/etc. = slow latency & OK bandwidth ❏ Explicitly via send and receive • now - interprocessor communication = fast latency and high • need to know destination and what to send bandwidth - becomes as important as the CPU • blocking vs. non-blocking option ❏ Resource Allocation • usually seen as message passing • smart Programmer - programmed • smart Compiler - static • smart OS - dynamic • hybrid - some of all of the above is the likely balance point

University of Utah 5 CS6810 University of Utah 6 CS6810 School of Computing School of Computing

Inter-PE Communication Communication Performance hardware perspective critical for MP performance ❏ Senders and Receivers ❏ 3 key factors • memory to memory • bandwidth • CPU to CPU • does the interconnect fabric support the needs of the whole collection • scalability issues • CPU activated/notified but transaction is memory to memory • latency • which memory - registers, caches, main memory • = sender overhead + time of flight + transmission time + receiver overhead ❏ Efficiency requires • transmission time = interconnect overhead • consistent SW & HW models • latency hiding capability of the processor nodes • lots of idle processors is not a good idea • policies should not conflict

detailed study of interconnects last chapter topic since we need to understand I/O first

University of Utah 7 CS6810 University of Utah 8 CS6810 School of Computing School of Computing Flynn’s Taxonomy - 1972 MIMD options too simple but it’s the only one that moderately works ❏ Heterogeneous vs. Homogeneous PE’s 4 Categories = (Single, Multiple) X (Data Stream, Instruction ❏ Stream) Communication Model ❏ SISD - conventional uniprocessor system • explicit: message passing • implicit: shared-memory • still lots of intra-CPU parallelism options • oddball: some shared some non-shared memory partitions ❏ SIMD - vector and array style computers ❏ Interconnection Topology • started with ILLIAC • which PE gets to talk directly to which PE • first accepted multiple PE style systems • blocking vs. non-blocking • now has fallen behind MIMD option • packet vs. circuit switched ❏ MISD - ~ systolic or stream machines • wormhole vs. store and forward • example: iWarp and MPEG encoder • combining vs. not ❏ MIMD - intrinsic parallel computers • synchronous vs. asynchronous • lots of options - today’s winner - our focus

University of Utah 9 CS6810 University of Utah 10 CS6810 School of Computing School of Computing

The Easy and Cheap Obvious Option Ideal Performance - the Holy Grail ❏ Microprocessors are cheap ❏ Requires perfect match between HW & SW ❏ Memory chips are cheap ❏ Tough given static HW and dynamic SW ❏ Hook them up somehow to get n PE’s • hard means cast in concrete ❏ Multiply each PE’s performance by n and get • soft means the programmer can write anything an impressive number ❏ Hence performance depends on: • The hardware: ISA, memory, cycle time, etc. What’s wrong with this picture? • The software: OS, task-switch, compiler, application code ❏ Simple performance model (aka uniprocessor) • most uP’s have been architected to be the only one in the system CPU-time (T)= Instruction-count (Ic)× CPI ×τCycle-time() • most memories only have one port • interconnect is not just somehow ❏ But CPI can vary by more than 10x • anybody who computes system performance with a single multiply is a moron University of Utah 11 CS6810 University of Utah 12 CS6810 School of Computing School of Computing CPI Stretch Factors The Idle Factor Paradox ❏ Conventional Uniprocessor factors ❏ After the stretch factor - the performance • TLB miss penalty, page fault penalty, cache miss penalty equation becomes • pipeline stall penalty, OS fraction penalty T= Ic× CPI ×τstretch × ❏ Additional Multiprocessor Factors • shared memory ❏ For an ideally scalable n PE system T/n will be • non-local access penalty • consistency maintenance penalty the CPU time required • message passing ❏ But idle time will create it’s own penalty • Send penalty even for non-blocking ❏ • Receive or notification penalty - task switch penalty (probably 2x) Hence n • Body copy penalty Ic× CPI ×τstretch × ∑ ------• Protection check penalty ()1Ð %idle T = ------i = 1 - • Etc. - the OS fraction goes up typically n ❏ What if %idle goes up faster than n?

University of Utah 13 CS6810 University of Utah 14 CS6810 School of Computing School of Computing

Shared Memory UMA Modern NUMA View Uniform Memory Access ❏ All uP’s set up for SMP ❏ Sequent Symmetry S-81 • SMP ::= symmetric multiprocessor • symmetric ==> all PE’s have same access to I/O, memory, • communication is usually the front side bus executive (OS) capability etc. • example • asymmetric ==> capability at PE’s differs • Pentium III and 4 Xeon’s set up to support 2 way SMP • just tie the FSB wires P0 P1 Pn • as clock speeds have gone up for n-way SMP’s $ $ $ • FSB capacitance has reduced the value of n ❏ Chip based SMP’s Interconnect (Bus, Crossbar, Multistage, ...) • IBM’s Power 4 • 2 Power 3 cores on the same die • set up to support 4 cores I/O0 I/Oj SM0 SM1 SMk

University of Utah 15 CS6810 University of Utah 16 CS6810 School of Computing School of Computing NUMA Shared Memory opus 1 level NUMA Shared Memory opus 2 level Non-Uniform Memory Access ❏ e.g. Univ. of Ill. Cedar + CMU CM* & C.mmp ❏ BBN Butterfly + others GSM GSM GSM

NOTES: LM0 P0 I Global Interconnect n transfer initiated t by: LMx or Px e LM1 P1 r P CSM P CSM c Answer to: o LMx or Px n P CSM P CSM n CIN CIN e All options have Today - nodes c been seen in can be SMP’s or LMn Pn t CMP’s practice P CSM P CSM e.g. SUN, Com- the easy and cheap option - just add paq, IBM interconnect

University of Utah 17 CS6810 University of Utah 18 CS6810 School of Computing School of Computing

COMA Shared Memory Lots of other DSM variants Cache Only Memory Access ❏ Cache consistency ❏ e.g. KSR-1 • DEC Firefly - up to 16 snooping caches in a workstation ❏ Directory based consistency • like the COMA model but deeper memory hierarchy Interconnect • e.g. Stanford DASH machine, MIT Alewife, Alliant FX-8 ❏ Delayed consistency • many models for the delayed updates D D D Directory • a software protocol more than a hardware model • e.g. MUNIN - John Carter (good old U of U) C C C Cache • other models - Alan Karp and the IBM crew

P P P Processor

University of Utah 19 CS6810 University of Utah 20 CS6810 School of Computing School of Computing NORMA Message Passing MIMD Machines

No remote memory access = message passing M M M ❏ Many players: P P P • Schlumberger FAIM-1 • HPL Mayfly • CalTech Cosmic Cube and Mosaic M P Message P M Passing • NCUBE Interconnect • Intel iPSC (binary n-cubes, meshes, torii, • Parsys SuperNode1000 M P and you name it) P M • Intel Paragon ❏ Remember the simple and cheap option? P P P • with the exception of the interconnect • this is the simple and cheap option M M M

University of Utah 21 CS6810 University of Utah 22 CS6810 School of Computing School of Computing

Message Passing vs. Shared Memory Parallel Performance Challenge • shared memory advantages ❏ Amdahl’s law in action • programming model is simple and familiar • enhanced = parallel in this case • quick port of existing code - then try to parallelize but at least something is running that you can profile • example1 - code centric • low communication overhead for small items • 80% of your code is parallel • OS isn’t in the way of a memory reference • ==> best you can do is get a speed up of 5 no matter how many processors • cacheing helps alleviate communication needs you throw at the problem • message passing advantages • example 2 - speedup centric • simple hardware ==> faster • want 80x speedup on 100 processors • communication is explicit - good and bad news for the programmer • ==> fractionparallel = .9975 • natural synchronization - associated with messages • this will be hard • duality ❏ Linear speed up is hard • either model can be built on the other ❏ • easier to map message passing onto shared memory than vice versa Superlinear speed up is easier • funny: message passing on the Origin 2K was faster than on the IBM SP2 • lots more memory may remove the need to page (note this hopefully has changed)

University of Utah 23 CS6810 University of Utah 24 CS6810 School of Computing School of Computing Modern Remote Memory Access Times Parallel Workloads critical limit for shared memory performance ❏ Even more disparate Table 1: • application characterists Remote • performance varies with Year max PE Memory Multiprocessor Type Interconnect Shipped count Access • uniprocessor and communication utilization (ns) • wide variance with architecture type Sun Starfire Servers 1996 SMP 64 multiple buses 500 ❏ 3 workloads studied SGI Origin 3000 1999 NUMA 512 fat hypercube 500 • commercial Cray T3E 1996 NUMA 2048 2-way 3D torus 300 • OLTP based on TPC-B HP series 1998 SMP 32 8x8 crossbar 1000 • DSS based on TPC-D Compaq AlphaServer GS 1999 SMP 32 switched busses 400 • Web index search based on AltaVista and a 200GB database • multiprogrammed & OS • 2 independent copies of compiling the Andrew file system • phases: compile (compute bound), install object files, remove files (I/O bound) • Scientific/Technical • FFT, LU, Ocean, and Barnes

University of Utah 25 CS6810 University of Utah 26 CS6810 School of Computing School of Computing

Workload Effort Characteristics Scientific/Technical Commercial Workload (4 processor AlphaServer 4100) ❏ FFT Table 1: • 1D version for a complex number FFT % time in % time in % time • 3 data structures - in and out arrays plus a precomputed read-only roots Benchmark user mode kernal mode CPU idle matrix OLTP 71 18 11 • steps DSS range for all 6 Queries 82-94 3-5 4-13 • transpose the data matrix DSS average 87 3.7 9.3 • 1D FFT on each row of data AltaVista >98 <1 <1 • multiply roots matrix by the data matrix • transpose data matrix Multiprogrammed & OS (8 processors - simulated) • 1D FFT on each row of data matrix Table 2: • transpose data matrix Synch CPU idle • communication User Kernel Wait (I/O wait) • all to all communication in the three transpose phases % instructions xeq’d 27 3 1 69 • each processor transposes one block locally and sends one block to each other processor % xeq time 27 7 2 64

University of Utah 27 CS6810 University of Utah 28 CS6810 School of Computing School of Computing The Other Kernel Ocean ❏ LU ❏ Goal • typical dense matrix factorization • global weather modeling • used in a variety of solvers and eigenvalue computations • note that 75% of the earth’s surface is ocean • turn a matrix into a upper diagonal • ocean currents and atmosphere have a major weather impact • blocking helps code to be cache friendly • near vertical walls there is a significant eddy effect • block size ❏ Physical problem • small enough to keep cache miss rate low • continuous in both 1D time and 3D space • large enough to maximize the parallel phase ❏ Discrete model for simulation • model the ocean as a discrete set of points equally spaced • point variables for pressure, current direction and speed, temperature, etc. • simplify here to a set of 2D point planes • admittedly less accurate and changes convergence aspects • eases the use of this application and still points out key issues

University of Utah 29 CS6810 University of Utah 30 CS6810 School of Computing School of Computing

Ocean’s Ocean Model Ocean the benchmark ❏ Data • 2D arrays for each variable • all arrays model each cross section plane ❏ Time • solving system of motion equations • sweep through all of the points for some point in time • continue to next time step Rectangular basin = 3D ❏ simplify = 2d plane set Granularity separate 2d array for each variable • big influence on computation time equal spaced points • 2Mm x 2Mm = atlantic ocean continuous ==> discrete • 5 years of 1 minute time steps & 1 Km spacing = 2.628Msteps for 4 x 106 pts is intractable • must go to larger grain for now - however solution style is the key here

University of Utah 31 CS6810 University of Utah 32 CS6810 School of Computing School of Computing Ocean Decomposition equation kernel solver ❏ Model the weighted nearest neighbor average ❏ Solves a differential equation • A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j] • via a finite difference method Evolve the sequential algorithm • operates on a (n+2) x (n+2) matrix bogus once again - little parallelism • +2 ==> border rows and columns which do not change while interior is Note the anti-diagonal option (orthogonal to being computed. resultant dependence vector) • then borders are changed and communicated and then we go to the next Control and Load Imbalance Issues?? step • uses Gauss-Seidel update • computes a weighted average for each point based on 4 neighbors Red Black Decomposition • order may vary but let’s start with row-major order Dependencies? • implies new values from above and left but old values from below and right neighbors Parallelism? • step termination Convergence properties? • if the sum of the difference for all points is less than some tolerance then done, otherwise make another sweep over the array

University of Utah 33 CS6810 University of Utah 34 CS6810 School of Computing School of Computing

Ocean Communication Blocking in Ocean side effect of blocked grid based solver influences cache locality perimeter vs. area

P0 P1 P2 P3 P4 P7 n/p Kernel P8 P11 mindless 2D version 2D inside 2D = 4D arrays • consider cache effects P12 P 15 • spatial and temporal locality • other effects 2 n Local Work α------• blocks can also be influenced by processor partition p • particularly useful if address space is shared as in a DSM machine • boundary problems? 4n Remote Communication α------p

University of Utah 35 CS6810 University of Utah 36 CS6810 School of Computing School of Computing Boundary Issues Ocean ❏ Inherent problem ❏ Sequential algorithm • assume row major order • outer loop over a very large number of time steps • column lines will have poor spatial locality • time-step = 33 computations • each one using a small number of grid variables • typical computation • sum of scalar multiples from close by grid points • nearest neighbor averaging sweep • add multigrid technique • levels of coarseness: +1 ==> ignore every other grid • start at finest level and look at the diff • if small but above tolerance then bump up +1 coarser • if large then -1 coarseness (in between then stay at this level) • accelerates convergence

University of Utah 37 CS6810 University of Utah 38 CS6810 School of Computing School of Computing

Barnes-Hut Octtree Hierarchy ❏ Simulates evolution of galaxies ❏ 3D galaxy represented as a octree • classic N-body problem • divide galaxy recursively into 8 equally sized children ❏ Characteristics • based on equal space volumes independent of membership • if a subspace has more than x bodies then subdivide again • no spatial regularity so computation becomes particle based • x typically will be something like 8 • every particle exerts influence on every other particle • tree is traversed once per body • ugh - hence O(n2) • determines force on that body • but clustering of distant star groups can be based on center of mass since • bodies move so tree is rebuilt every time step M1M2 Gravitational Force = G------❏ 2 Group optimization r • if the cell is far enough away • result is O(n log n) • l/d < x, l=length of a side of the cell, d distance of body from cell center of • close stars must be represented individually mass, and x is the accuracy parameter (typically between .5 and 1.2) • then treat the cell as a single body • otherwise you have to open the cell and proceed

University of Utah 39 CS6810 University of Utah 40 CS6810 School of Computing School of Computing Tree Example Barnes-Hut 2D = quadtree ❏ Sequential Algorithm • 100’s of time steps • each step computes the net force on every body • updates body position and other attributes (velocity, acceleration, direction) • flow

compute forces dominant build tree phase compute cell moments update QuadTree Equivalent properties traverse tree Each non-leaf has center of mass for comp. forces the group time steps 2D Spatial Decomposition Each leaf has mass, velocity, etc.

University of Utah 41 CS6810 University of Utah 42 CS6810 School of Computing School of Computing

S/T Workload Scaling

Table 1: Computation Compute/ Communication Application Scaling Communicate Scaling per processor Scaling FFT (n log n)/p n/p log n LU n/p n n ------p p

Barnes (n log n)/p approximately approximately

nnlog n ------p p

Ocean n/p n n ------p p

University of Utah 43 CS6810 School of Computing