Why Parallel Parallel Processors Types of Parallelism Historical
Total Page:16
File Type:pdf, Size:1020Kb
Why Parallel Parallel Processors the greed for speed is a permanent malady ❏ The high end requires this approach 2 basic options: • DOE’s ASCI program for example ❏ Build a faster uniprocessor ❏ Advantages • advantages • leverage off the sweet spot technology • programs don’t need to change • huge partially unexplored set of options • compilers may need to change to take advantage of intra-CPU parallelism • disadvantages ❏ Disadvantages • improved CPU performance is very costly - we already see diminishing • software - optimized balance and change are required returns • very large memories are slow • overheads - a whole new set of organizational disasters are now possible ❏ Parallel Processors • today implemented as an ensemble of microprocessors • SAN style interconnect • large variation in how memory is treated University of Utah 1 CS6810 University of Utah 2 CS6810 School of Computing School of Computing Types of Parallelism Historical Perspective Note: many overlaps Table 1: Technology and Software and Representative Generation • lookahead & pipelining Architecture Applications Systems • vectorization First Vacuum tubes and relay Machine language ENIAC (1945 - 1954) memories - simple PC and Single user Princeton IAS • concurrency & simultaneity ACC Programmed I/O IBM 701 Second Discrete transistor Fortran & Cobol IBM 7090 • data and control parallelism (1955 - 1964) Core Memory Subroutine libraries CDC 1604 Floating point arith. Batch processing OS Univac LARC • partitioning & specialization I/O Processors Burroughs B5500 Third SSI and MSI IC’s More HLL’s IBM 360/370 • interleaving & overlapping of physical subsystems (1965 - 1974) microprogramming Multiprogramming CDC 6600 pipelining, cache, and loo- and Timesharing OS TI ASC • multiplicity & replication kahead Protection and file PDP-88 system capability • time & space sharing Fourth LSI/VLSI processors, semi- Multiprocessor OS, parallel VAX 9000 (1975 - 1990) conductor memory, vector languages, multiuser Cray X-MP • multitasking & multiprogramming supercomputers, applications FPS T2000 • multi-threading multicomputers IBM 3090 Fifth ULSI/VHSIC processors, MPP, grand challenge IBM SP • distributed computing - for speed or availability (1991 - present) memory, switches. High applications, distributed SGI Origin density packages and and heterogeneous process- Intel ASCI Red scalable architectures ing, I?O becomes real University of Utah 3 CS6810 University of Utah 4 CS6810 School of Computing School of Computing What changes when you get more than 1? Inter-PE Communication everything is the easy answer! software perspective 2 areas deserve special attention ❏ Implicit via memory ❏ Communication • distinction of local vs. remote • 2 aspects always are of concern • implies some shared memory • latency & bandwidth • sharing model and access model must be consistent • before - I/O meant disk/etc. = slow latency & OK bandwidth ❏ Explicitly via send and receive • now - interprocessor communication = fast latency and high • need to know destination and what to send bandwidth - becomes as important as the CPU • blocking vs. non-blocking option ❏ Resource Allocation • usually seen as message passing • smart Programmer - programmed • smart Compiler - static • smart OS - dynamic • hybrid - some of all of the above is the likely balance point University of Utah 5 CS6810 University of Utah 6 CS6810 School of Computing School of Computing Inter-PE Communication Communication Performance hardware perspective critical for MP performance ❏ Senders and Receivers ❏ 3 key factors • memory to memory • bandwidth • CPU to CPU • does the interconnect fabric support the needs of the whole collection • scalability issues • CPU activated/notified but transaction is memory to memory • latency • which memory - registers, caches, main memory • = sender overhead + time of flight + transmission time + receiver overhead ❏ Efficiency requires • transmission time = interconnect overhead • consistent SW & HW models • latency hiding capability of the processor nodes • lots of idle processors is not a good idea • policies should not conflict detailed study of interconnects last chapter topic since we need to understand I/O first University of Utah 7 CS6810 University of Utah 8 CS6810 School of Computing School of Computing Flynn’s Taxonomy - 1972 MIMD options too simple but it’s the only one that moderately works ❏ Heterogeneous vs. Homogeneous PE’s 4 Categories = (Single, Multiple) X (Data Stream, Instruction ❏ Stream) Communication Model ❏ SISD - conventional uniprocessor system • explicit: message passing • implicit: shared-memory • still lots of intra-CPU parallelism options • oddball: some shared some non-shared memory partitions ❏ SIMD - vector and array style computers ❏ Interconnection Topology • started with ILLIAC • which PE gets to talk directly to which PE • first accepted multiple PE style systems • blocking vs. non-blocking • now has fallen behind MIMD option • packet vs. circuit switched ❏ MISD - ~ systolic or stream machines • wormhole vs. store and forward • example: iWarp and MPEG encoder • combining vs. not ❏ MIMD - intrinsic parallel computers • synchronous vs. asynchronous • lots of options - today’s winner - our focus University of Utah 9 CS6810 University of Utah 10 CS6810 School of Computing School of Computing The Easy and Cheap Obvious Option Ideal Performance - the Holy Grail ❏ Microprocessors are cheap ❏ Requires perfect match between HW & SW ❏ Memory chips are cheap ❏ Tough given static HW and dynamic SW ❏ Hook them up somehow to get n PE’s • hard means cast in concrete ❏ Multiply each PE’s performance by n and get • soft means the programmer can write anything an impressive number ❏ Hence performance depends on: • The hardware: ISA, memory, cycle time, etc. What’s wrong with this picture? • The software: OS, task-switch, compiler, application code ❏ Simple performance model (aka uniprocessor) • most uP’s have been architected to be the only one in the system CPU-time (T)= Instruction-count (Ic)× CPI ×τCycle-time() • most memories only have one port • interconnect is not just somehow ❏ But CPI can vary by more than 10x • anybody who computes system performance with a single multiply is a moron University of Utah 11 CS6810 University of Utah 12 CS6810 School of Computing School of Computing CPI Stretch Factors The Idle Factor Paradox ❏ Conventional Uniprocessor factors ❏ After the stretch factor - the performance • TLB miss penalty, page fault penalty, cache miss penalty equation becomes • pipeline stall penalty, OS fraction penalty T= Ic× CPI ×τstretch × ❏ Additional Multiprocessor Factors • shared memory ❏ For an ideally scalable n PE system T/n will be • non-local access penalty • consistency maintenance penalty the CPU time required • message passing ❏ But idle time will create it’s own penalty • Send penalty even for non-blocking ❏ • Receive or notification penalty - task switch penalty (probably 2x) Hence n • Body copy penalty Ic× CPI ×τstretch × ∑ --------------------------------------------------- • Protection check penalty ()1– %idle T = ---------------------------------------------------------------i = 1 - • Etc. - the OS fraction goes up typically n ❏ What if %idle goes up faster than n? University of Utah 13 CS6810 University of Utah 14 CS6810 School of Computing School of Computing Shared Memory UMA Modern NUMA View Uniform Memory Access ❏ All uP’s set up for SMP ❏ Sequent Symmetry S-81 • SMP ::= symmetric multiprocessor • symmetric ==> all PE’s have same access to I/O, memory, • communication is usually the front side bus executive (OS) capability etc. • example • asymmetric ==> capability at PE’s differs • Pentium III and 4 Xeon’s set up to support 2 way SMP • just tie the FSB wires P0 P1 Pn • as clock speeds have gone up for n-way SMP’s $ $ $ • FSB capacitance has reduced the value of n ❏ Chip based SMP’s Interconnect (Bus, Crossbar, Multistage, ...) • IBM’s Power 4 • 2 Power 3 cores on the same die • set up to support 4 cores I/O0 I/Oj SM0 SM1 SMk University of Utah 15 CS6810 University of Utah 16 CS6810 School of Computing School of Computing NUMA Shared Memory opus 1 level NUMA Shared Memory opus 2 level Non-Uniform Memory Access ❏ e.g. Univ. of Ill. Cedar + CMU CM* & C.mmp ❏ BBN Butterfly + others GSM GSM GSM NOTES: LM0 P0 I Global Interconnect n transfer initiated t by: LMx or Px e LM1 P1 r P CSM P CSM c Answer to: o LMx or Px n P CSM P CSM n CIN CIN e All options have Today - nodes c been seen in can be SMP’s or LMn Pn t CMP’s practice P CSM P CSM e.g. SUN, Com- the easy and cheap option - just add paq, IBM interconnect University of Utah 17 CS6810 University of Utah 18 CS6810 School of Computing School of Computing COMA Shared Memory Lots of other DSM variants Cache Only Memory Access ❏ Cache consistency ❏ e.g. KSR-1 • DEC Firefly - up to 16 snooping caches in a workstation ❏ Directory based consistency • like the COMA model but deeper memory hierarchy Interconnect • e.g. Stanford DASH machine, MIT Alewife, Alliant FX-8 ❏ Delayed consistency • many models for the delayed updates D D D Directory • a software protocol more than a hardware model • e.g. MUNIN - John Carter (good old U of U) C C C Cache • other models - Alan Karp and the IBM crew P P P Processor University of Utah 19 CS6810 University of Utah 20 CS6810 School of Computing School of Computing NORMA Message Passing MIMD Machines No remote memory access = message passing M M M ❏ Many players: P P P • Schlumberger FAIM-1 • HPL Mayfly • CalTech Cosmic Cube and Mosaic M P Message P M Passing • NCUBE Interconnect