CS 258 Parallel Computer Architecture Lecture 2
Total Page:16
File Type:pdf, Size:1020Kb
Review CS 258 • Industry has decided that Multiprocessing is the Parallel Computer Architecture future/best use of transistors Lecture 2 – Every major chip manufacturer now making MultiCore chips • History of microprocessor architecture is parallelism Convergence of Parallel Architectures – translates area and density into performance • The Future is higher levels of parallelism – Parallel Architecture concepts apply at many levels January 28, 2008 – Communication also on exponential curve Prof John D. Kubiatowicz • Proper way to compute speedup – Incorrect way to measure: http://www.cs.berkeley.edu/~kubitron/cs258 » Compare parallel program on 1 processor to parallel program on p processors – Instead: » Should compare uniprocessor program on 1 processor to parallel program on p processors 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.2 History Plan for Today • Parallel architectures tied closely to programming • Look at major programming models models – where did they come from? – Divergent architectures, with no predictable pattern of – The 80s architectural rennaisance! growth. – What do they provide? – Mid 80s renaissance – How have they converged? • Extract general structure and fundamental issues Application Software Systolic System Arrays Software SIMD Architecture Systolic SIMD Message Passing Arrays Generic Dataflow Architecture Shared Memory Message Passing Dataflow Shared Memory 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.3 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.4 Programming Model Shared Memory ⇒ Shared Addr. Space Processor • Conceptualization of the machine that programmer Processor uses in coding applications Processor – How parts cooperate and coordinate their activities – Specifies communication and synchronization operations Processor • Multiprogramming Shared Memory – no communication or synch. at program level Processor • Shared address space Processor – like bulletin board • Message passing Processor – like letters or phone calls, explicit point to point Processor • Data parallel: • Range of addresses shared by all processors – more regimented, global actions on data – All communication is Implicit (Through memory) – Implemented with shared address space or message passing – Want to communicate a bunch of info? Pass pointer. • Programming is “straightforward” – Generalization of multithreaded programming 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.5 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.6 Historical Development Adding Processing Capacity • “Mainframe” approach – Motivated by multiprogramming I/O P devices – Extends crossbar used for Mem and I/O – Processor cost-limited => crossbar P Mem Mem Mem Mem I/O ctrl I/O ctrl – Bandwidth scales with p I/O C – High incremental cost I/O C Interconnect Interconnect » use multistage instead M M M M • “Minicomputer” approach Processor Processor – Almost all microprocessor systems have bus – Motivated by multiprogramming, TP • Memory capacity increased by adding modules – Used heavily for parallel computing I/O I/O – Called symmetric multiprocessor (SMP) C C M M • I/O by controllers and devices – Latency larger than for uniprocessor • Add processors for processing! – Bus is bandwidth bottleneck – For higher-throughput multiprogramming, or parallel » caching is key: coherence problem $ $ programs – Low incremental cost P P 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.7 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.8 Shared Physical Memory Shared Virtual Address Space • Any processor can directly reference any location • Process = address space plus thread of control – Communication operation is load/store • Virtual-to-physical mapping can be established so – Special operations for synchronization that processes shared portions of address space. • Any I/O controller - any memory – User-kernel or multiple processes • Multiple threads of control on one address space. • Operating system can run on any processor, or all. – Popular approach to structuring OS’s – OS uses shared memory to coordinate – Now standard application capability (ex: POSIX threads) • Writes to shared address visible to other threads • What about application processes? – Natural extension of uniprocessors model – conventional memory operations for communication – special atomic operations for synchronization » also load/stores 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.9 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.10 Structured Shared Address Space Cache Coherence Problem R? Machine physical address space Virtual address spaces for a collection of processes communicating W R? via shared addresses Pn pr i vat e $4 $4 $4 $4 Load Pn Common physical P2 addresses P1 Write-Through? P0 0 4 St or e 1 5 Miss P2 pr i vat e Shared portion 2 6 of address space 3 7 P1 pr i vat e Private portion of address space P0 pr i vat e • Caches are aliases for memory locations • Does every processor eventually see new value? • Add hoc parallelism used in system code • Tightly related: Cache Consistency • Most parallel applications have structured SAS – In what order do writes appear to other processors? • Same program on each processor • Buses make this easy: every processor can snoop on – shared variable X means the same thing to each thread every write – Essential feature: Broadcast 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.11 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.12 Engineering: Intel Pentium Pro Quad Engineering: SUN Enterprise CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ CPU/mem Bus interface P P cards $ $ P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz) $2 $2 Mem ctrl PCI PCI Memory Bus interface/switch bridge bridge controller PCI I/O MIU cards PCI bus PCI bus 1-, 2-, or 4-way Gigaplane bus (256 data, 41 address, 83 MHz) interleaved DRAM I/O cards Bus interface – All coherence and multiprocessing glue in processor module SBUS SBUS SBUS 100bT, SCSI 100bT, • Proc + mem card - I/O card 2 FiberChannel – Highly integrated, targeted at high volume – 16 cards of either type – All memory accessed over bus, so symmetric – Low latency and bandwidth – Higher bandwidth, higher latency bus 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.13 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.14 Quad-Processor Xeon Architecture Scaling Up M M M °°° Omega Network Network General Network Network $ $ °°° $ M $ M $ °°° M $ P P P P P P “Dance hall” Distributed memory – Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar » latencies to memory uniform, but uniformly large – Distributed memory or non-uniform memory access (NUMA) » Construct shared address space out of simple message • All sharing through pairs of front side busses (FSB) transactions across a general-purpose network (e.g. read- – Memory traffic/cache misses through single chipset to memory request, read-response) – Example “Blackford” chipset – Caching shared (particularly nonlocal) data? 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.15 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.16 Stanford DASH The MIT Alewife Multiprocessor • Clusters of 4 processors share 2nd-level cache • Up to 16 clusters tied together with 2-dim mesh • 16-bit directory associated with every memory line – Each memory line has home cluster that contains DRAM – The 16-bit vector says which clusters (if any) have read copies – Only one writer permitted at a time P P P P L1-$ L1-$ L1-$ L1-$ • Cache-coherence Shared Memory – Partially in Software! L2-Cache – Limited Directory + software overflow • User-level Message-Passing • Rapid Context-Switching • Never got more than 12 clusters (48 processors) • 2-dimentional Asynchronous network working at one time: Asynchronous network probs! • One node/board • Got 32-processors (+ I/O boards) working 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.17 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.18 Engineering: Cray T3E AMD Direct Connect External I/O P Mem $ Mem ctrl and NI XY Switch Z – Scale up to 1024 processors, 480MB/s links – Memory controller generates request message for non-local references • Communication over general interconnect – No hardware mechanism for coherence – Shared memory/address space traffic over network » SGI Origin etc. provide this – I/O traffic to memory over network – Multiple topology options (seems to scale to 8 or 16 processor chips) 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.19 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.20 What is underlying Shared Memory?? Message Passing Architectures Network • Complete computer as building block, including I/O – Communication via Explicit I/O operations M $ M $ °°° M $ P P P • Programming model – direct access only to private address space (local memory), – communication via explicit messages (send/receive) Systolic Network Arrays SIMD • High-level block diagram Generic – Communication integration? Architecture Message Passing » Mem, I/O, LAN, Cluster M $ M $ °°° M $ – Easier to build and scale than SAS Dataflow P P P Shared Memory • Programming model more removed from basic • Packet switched networks better utilize available hardware operations link bandwidth than circuit switched networks – Library or OS intervention • So, network passes messages around! 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.21 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.22 Message-Passing Abstraction Evolution of Message-Passing Machines • Early machines: FIFO on each link – HW close to prog. Model; Match ReceiveY