Parallel Processing and Multiprocessors Why Parallel

Parallel Processing and Multiprocessors Why Parallel Processing why parallel processing? go past physical limits of uniprocessing (speed of light) types of parallel processors pros: performance cache coherence • power • cost-effectiveness (commodity parts) synchronization • fault tolerance memory ordering cons: difficult to parallelize applications • automatic by compiler hard in general cases • parallel program development • IT IS THE SOFTWARE, stupid! © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 2 Amdahl’s Law Application Domains speedup = 1/(fracenhanced/speedupenhanced + 1 - fracenhanced) Parallel Processing - true parallelism in one job • data may be tightly shared speedup of 80 with 100 processors OS - large parallel program that runs a lot of time => fracparallel = 0.9975 • typically hand-crafted and fine-tuned only 0.25% work can be serial • data more loosely shared may help: problems where parallel parts scale faster than serial • typically locked data structures at differing granularities • O(n2) parallel vs. O(n) serial transaction processing - parallel among independent transactions challenge: long latencies (often several microsecs) • throughput oriented parallelism • achieve data locality in some fashion © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 3 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 4 Types Single Instruction Single Data (SISD) Flynn Taxonomy Operand • 1966 Instruction Storage Storage • not all encompassing but simple single single • based on # instruction streams and data streams instruction data stream stream • SISD - uniprocessor Instruction Execution Unit Unit • SIMD - like vector • MISD - few practical examples your basic uniprocessor • MIMD - multiprocessors - most common, very flexible © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 5 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 6 Single Instruction Multiple Data (SIMD) Single Instruction Multiple Data (SIMD) Interconnect/Alignment Vectors are same as SIMD Network • deeply pipelined FUs vs. multiple FUs in previous slide Data Data Data Memory Memory. Memory intrs and data usually separated registers registers registers flag flag . flag leads to data parallel programming model ALU ALU ALU . works best for very regular, loop-oriented problems instruction • many important classes- eg graphics broadcast Control Processor not for commercial databases, middleware (80% of server codes) Instruction Memory automatic parallelization can work © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 7 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 8 Multiple Instruction Multiple Data (MIMD) Perfection: PRAM most flexible and of most interest to us Main Memory no has become the general-purpose computer contention . automatic parallelization more difficult unit Interconnection Network latency no contention Processor Processor . Processor parallel RAM - theoretical model fully shared memory - unit latency no contention, no need to exploit locality © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 9 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 10 Perfection not achievable UMA: uniform memory access latencies grow as the system size grows bandwidths restricted by memory organization and interconnect Main Memory contention in memory banks dealing with reality leads to division between • UMA and NUMA . long Interconnection Network latency contention in network Processor Processor . Processor © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 11 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 12 UMA: uniform memory access Caches latencies are the same another way of tackling latency/bandwidth problems • but may be high holds recently used data data placement unimportant BUT cache coherence problems latency gets worse as system grows => scalability issues typically used in small MPs only contention restricts bandwidth caches are often allowed in UMA systems © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 13 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 14 NUMA: non-uniform memory access NUMA: non-uniform memory access latency low to local memory Interconnection Network latency much higher to remote memories long contention in network latency performance very sensitive to data placement bandwidth to local may be higher contention in network and for memories Memory Memory . Memory short latency Processor Processor . Processor © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 15 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 16 NUMA Multiprocessors Clustered Systems shared memory small UMA nodes in large UMA systems • one logical address space hybrid of sorts • can be treated as shared memory note: ambiguity of the term “cluster” • use synchronization (e.g., locks) to access shared data multicomputers (message passing) • each processor has its own memory address space • use message passing for communication © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 17 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 18 Cluster types Cluster types Cluster Interconnect Shared Memory ... ... ... ... Cluster Memory 0 . Cluster Memory Cluster Memory 0 . Cluster Memory 7 7 ... ... ... ... Proc. Proc. Proc. Proc. Proc. Proc. Proc. Proc. 0 7 56 63 0 7 56 63 globally shared memory - Illinois Cedar No global memory - Stanford Dash, Wisconsin Typhoon © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 19 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 20 COMA: cache only memory architecture Writing Parallel Programs Decomposition, Interconnection Network long • where is the parallelism contention in network latency • Break up work Assignment Cache Cache. Cache • which thread does what (think of data) short latency Orchestration Processor Processor. Processor • synchronization, communication effectively a form of NUMA Mapping • caches only causes data to migrate naturally • which thread runs where (usually thru OS) © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 21 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 22 Process communication Shared memory vs message passing for most interesting parallel applications shared memory • parallel processes (tasks) need to communicate • programs use loads/stores communication method leads to another division + conceptually compatible with uniprocessors/small MPs • message passing + ease of programming if communication complex/dynamic • shared memory + lower latency communicating small data items + hardware controlled sharing allows automatic data motion © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 23 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 24 Shared memory vs message passing Shared memory vs message passing message passing message passing • programs use sends/receives distribute data carefully to threads + simpler hardware • no automatic data motion) + communication pattern explicit and precise partition data if possible, replicate if not + but they MUST be explicit and precise • replicate in s/w not automatic (so extra work to ensure ok) + least common denominator asynchronous or synchronous sends? • shared memory MP can emulate message passing easily • biggest programming burden: managing communication coalesce small mesgs into large, gather separate data into one artifacts mesg to send and scatter received mesg into separate data © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 25 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 26 Shared Mem Mesg Passing Thread1 Thread2 Thread1 Thread2 .... ..... .... ..... compute (data) synchronize compute (data) receive (mesg) store( A,B, C, D, ...) load (A, B, C, D, ....) store (A,B, C, D ..) scatter (mesg to A B C D ..) synchronize compute gather (A B C D into mesg) load (A, B, C, D, ....) ...... ..... send (mesg) compute A B C D SAME in both threads- SINGLE shared memory ...... ..... A B C D are DIFFERENT in each thread -- PRIVATE memory © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 27 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 28 Eg: Sequential Ocean Eg: Shared Memory Ocean © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 29 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 30 Eg: Mesg Passing Ocean Eg: Mesg Pass Ocean © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 31 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 32 Process vs. thread Shared memory MPs: cache coherence heavy-weight process e.g., • separate PC, regs, stack • proc 1 reads A • different address space (page table, heap) • proc 2 reads A light-weight processes aka threads • proc 1 writes A • different PC, regs, stack (“context”) • now proc 2 has stale data regardless of write-thru/-back • same address space informally - method to make memory coherent despite caches • caches can be viewed as large buffers • with potentially very large delays sharing across heavy-weight processes possible via page table • replication/buffering + writes = coherence © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 33 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 4 34 Shared memory MPs: cache coherence Causes of Incoherence cache coherence suggests absolute time scale sharing of writeable data • not necessary • cause most commonly considered • what is required is appearance of coherence process migration • not instantaneous coherence • can occur even if independent jobs are executing • Bart Simpson’s famous words - I/O • “nobody saw me do it so I didn’t do it!” • can be fixed by OS cache flushes • e.g. temporary incoherence between • writeback

Load more