CS 258 Parallel Computer Architecture Lecture 2

Review CS 258 • Industry has decided that Multiprocessing is the Parallel Computer Architecture future/best use of transistors Lecture 2 – Every major chip manufacturer now making MultiCore chips • History of microprocessor architecture is parallelism Convergence of Parallel Architectures – translates area and density into performance • The Future is higher levels of parallelism – Parallel Architecture concepts apply at many levels January 28, 2008 – Communication also on exponential curve Prof John D. Kubiatowicz • Proper way to compute speedup – Incorrect way to measure: http://www.cs.berkeley.edu/~kubitron/cs258 » Compare parallel program on 1 processor to parallel program on p processors – Instead: » Should compare uniprocessor program on 1 processor to parallel program on p processors 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.2 History Plan for Today • Parallel architectures tied closely to programming • Look at major programming models models – where did they come from? – Divergent architectures, with no predictable pattern of – The 80s architectural rennaisance! growth. – What do they provide? – Mid 80s renaissance – How have they converged? • Extract general structure and fundamental issues Application Software Systolic System Arrays Software SIMD Architecture Systolic SIMD Message Passing Arrays Generic Dataflow Architecture Shared Memory Message Passing Dataflow Shared Memory 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.3 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.4 Programming Model Shared Memory ⇒ Shared Addr. Space Processor • Conceptualization of the machine that programmer Processor uses in coding applications Processor – How parts cooperate and coordinate their activities – Specifies communication and synchronization operations Processor • Multiprogramming Shared Memory – no communication or synch. at program level Processor • Shared address space Processor – like bulletin board • Message passing Processor – like letters or phone calls, explicit point to point Processor • Data parallel: • Range of addresses shared by all processors – more regimented, global actions on data – All communication is Implicit (Through memory) – Implemented with shared address space or message passing – Want to communicate a bunch of info? Pass pointer. • Programming is “straightforward” – Generalization of multithreaded programming 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.5 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.6 Historical Development Adding Processing Capacity • “Mainframe” approach – Motivated by multiprogramming I/O P devices – Extends crossbar used for Mem and I/O – Processor cost-limited => crossbar P Mem Mem Mem Mem I/O ctrl I/O ctrl – Bandwidth scales with p I/O C – High incremental cost I/O C Interconnect Interconnect » use multistage instead M M M M • “Minicomputer” approach Processor Processor – Almost all microprocessor systems have bus – Motivated by multiprogramming, TP • Memory capacity increased by adding modules – Used heavily for parallel computing I/O I/O – Called symmetric multiprocessor (SMP) C C M M • I/O by controllers and devices – Latency larger than for uniprocessor • Add processors for processing! – Bus is bandwidth bottleneck – For higher-throughput multiprogramming, or parallel » caching is key: coherence problem $ $ programs – Low incremental cost P P 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.7 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.8 Shared Physical Memory Shared Virtual Address Space • Any processor can directly reference any location • Process = address space plus thread of control – Communication operation is load/store • Virtual-to-physical mapping can be established so – Special operations for synchronization that processes shared portions of address space. • Any I/O controller - any memory – User-kernel or multiple processes • Multiple threads of control on one address space. • Operating system can run on any processor, or all. – Popular approach to structuring OS’s – OS uses shared memory to coordinate – Now standard application capability (ex: POSIX threads) • Writes to shared address visible to other threads • What about application processes? – Natural extension of uniprocessors model – conventional memory operations for communication – special atomic operations for synchronization » also load/stores 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.9 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.10 Structured Shared Address Space Cache Coherence Problem R? Machine physical address space Virtual address spaces for a collection of processes communicating W R? via shared addresses Pn pr i vat e $4 $4 $4 $4 Load Pn Common physical P2 addresses P1 Write-Through? P0 0 4 St or e 1 5 Miss P2 pr i vat e Shared portion 2 6 of address space 3 7 P1 pr i vat e Private portion of address space P0 pr i vat e • Caches are aliases for memory locations • Does every processor eventually see new value? • Add hoc parallelism used in system code • Tightly related: Cache Consistency • Most parallel applications have structured SAS – In what order do writes appear to other processors? • Same program on each processor • Buses make this easy: every processor can snoop on – shared variable X means the same thing to each thread every write – Essential feature: Broadcast 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.11 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.12 Engineering: Intel Pentium Pro Quad Engineering: SUN Enterprise CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ CPU/mem Bus interface P P cards $ $ P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz) $2 $2 Mem ctrl PCI PCI Memory Bus interface/switch bridge bridge controller PCI I/O MIU cards PCI bus PCI bus 1-, 2-, or 4-way Gigaplane bus (256 data, 41 address, 83 MHz) interleaved DRAM I/O cards Bus interface – All coherence and multiprocessing glue in processor module SBUS SBUS SBUS 100bT, SCSI 100bT, • Proc + mem card - I/O card 2 FiberChannel – Highly integrated, targeted at high volume – 16 cards of either type – All memory accessed over bus, so symmetric – Low latency and bandwidth – Higher bandwidth, higher latency bus 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.13 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.14 Quad-Processor Xeon Architecture Scaling Up M M M °°° Omega Network Network General Network Network $ $ °°° $ M $ M $ °°° M $ P P P P P P “Dance hall” Distributed memory – Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar » latencies to memory uniform, but uniformly large – Distributed memory or non-uniform memory access (NUMA) » Construct shared address space out of simple message • All sharing through pairs of front side busses (FSB) transactions across a general-purpose network (e.g. read- – Memory traffic/cache misses through single chipset to memory request, read-response) – Example “Blackford” chipset – Caching shared (particularly nonlocal) data? 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.15 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.16 Stanford DASH The MIT Alewife Multiprocessor • Clusters of 4 processors share 2nd-level cache • Up to 16 clusters tied together with 2-dim mesh • 16-bit directory associated with every memory line – Each memory line has home cluster that contains DRAM – The 16-bit vector says which clusters (if any) have read copies – Only one writer permitted at a time P P P P L1-$ L1-$ L1-$ L1-$ • Cache-coherence Shared Memory – Partially in Software! L2-Cache – Limited Directory + software overflow • User-level Message-Passing • Rapid Context-Switching • Never got more than 12 clusters (48 processors) • 2-dimentional Asynchronous network working at one time: Asynchronous network probs! • One node/board • Got 32-processors (+ I/O boards) working 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.17 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.18 Engineering: Cray T3E AMD Direct Connect External I/O P Mem $ Mem ctrl and NI XY Switch Z – Scale up to 1024 processors, 480MB/s links – Memory controller generates request message for non-local references • Communication over general interconnect – No hardware mechanism for coherence – Shared memory/address space traffic over network » SGI Origin etc. provide this – I/O traffic to memory over network – Multiple topology options (seems to scale to 8 or 16 processor chips) 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.19 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.20 What is underlying Shared Memory?? Message Passing Architectures Network • Complete computer as building block, including I/O – Communication via Explicit I/O operations M $ M $ °°° M $ P P P • Programming model – direct access only to private address space (local memory), – communication via explicit messages (send/receive) Systolic Network Arrays SIMD • High-level block diagram Generic – Communication integration? Architecture Message Passing » Mem, I/O, LAN, Cluster M $ M $ °°° M $ – Easier to build and scale than SAS Dataflow P P P Shared Memory • Programming model more removed from basic • Packet switched networks better utilize available hardware operations link bandwidth than circuit switched networks – Library or OS intervention • So, network passes messages around! 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.21 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.22 Message-Passing Abstraction Evolution of Message-Passing Machines • Early machines: FIFO on each link – HW close to prog. Model; Match ReceiveY

CS 258 Parallel Computer Architecture Lecture 2

Real-Time Visualization of Aerospace Simulations Using Computational Steering and Beowulf Clusters

Computer Hardware

Parallel Computer Architecture

Programming Languages, Database Language SQL, Graphics, GOSIP

ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V

On the Efficiency of Register File Versus Broadcast Interconnect For

Computer Architectures

Systolic Computing on Gpus for Productive Performance

CS252 Lecture Notes Multithreaded Architectures

Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes

Research Article a Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm

R00456--FM Getting up to Speed