<<

Review CS 258 • Industry has decided that is the Parallel future/best use of transistors Lecture 2 – Every major chip manufacturer now making MultiCore chips • History of architecture is parallelism Convergence of Parallel Architectures – translates area and density into performance • The Future is higher levels of parallelism – Parallel Architecture concepts apply at many levels January 28, 2008 – Communication also on exponential curve Prof John D. Kubiatowicz • Proper way to compute – Incorrect way to measure: http://www.cs.berkeley.edu/~kubitron/cs258 » Compare parallel program on 1 processor to parallel program on p processors – Instead: » Should compare uniprocessor program on 1 processor to parallel program on p processors

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.2

History Plan for Today

• Parallel architectures tied closely to programming • Look at major programming models models – where did they come from? – Divergent architectures, with no predictable pattern of – The 80s architectural rennaisance! growth. – What do they provide? – Mid 80s renaissance – How have they converged? • Extract general structure and fundamental issues

Application Software

Systolic System Arrays Software SIMD Architecture Systolic SIMD Message Passing Arrays Generic Message Passing Dataflow Shared Memory

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.3 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.4 Programming Model Shared Memory ⇒ Shared Addr. Space

Processor • Conceptualization of the machine that programmer Processor uses in coding applications Processor – How parts cooperate and coordinate their activities – Specifies communication and synchronization operations Processor

• Multiprogramming Shared Memory – no communication or synch. at program level Processor • Shared address space Processor – like bulletin board

• Message passing Processor – like letters or phone calls, explicit point to point Processor • Data parallel: • Range of addresses shared by all processors – more regimented, global actions on data – All communication is Implicit (Through memory) – Implemented with shared address space or message passing – Want to communicate a bunch of info? Pass pointer. • Programming is “straightforward” – Generalization of multithreaded programming 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.5 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.6

Historical Development Adding Processing Capacity

• “Mainframe” approach – Motivated by multiprogramming I/O P devices – Extends crossbar used for Mem and I/O – Processor cost-limited => crossbar P Mem Mem Mem Mem I/O ctrl I/O ctrl – Bandwidth scales with p I/O C – High incremental cost I/O C Interconnect Interconnect » use multistage instead M M M M

• “Minicomputer” approach Processor Processor – Almost all microprocessor systems have bus – Motivated by multiprogramming, TP • Memory capacity increased by adding modules – Used heavily for I/O I/O – Called symmetric multiprocessor (SMP) C C M M • I/O by controllers and devices – Latency larger than for uniprocessor • Add processors for processing! – Bus is bandwidth bottleneck – For higher-throughput multiprogramming, or parallel » caching is key: coherence problem $ $ programs – Low incremental cost P P

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.7 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.8 Shared Physical Memory Shared Virtual Address Space

• Any processor can directly reference any location • = address space plus of control – Communication operation is load/store • Virtual-to-physical mapping can be established so – Special operations for synchronization that processes shared portions of address space. • Any I/O controller - any memory – User-kernel or multiple processes • Multiple threads of control on one address space. • Operating system can run on any processor, or all. – Popular approach to structuring OS’s – OS uses shared memory to coordinate – Now standard application capability (ex: POSIX threads) • Writes to shared address visible to other threads • What about application processes? – Natural extension of uniprocessors model – conventional memory operations for communication – special atomic operations for synchronization » also load/stores

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.9 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.10

Structured Shared Address Space Problem R? Machine physical address space Virtual address spaces for a collection of processes communicating W R? via shared addresses Pn pr i vat e $4 $4 $4 $4 Load Pn

Common physical P2 addresses P1 Write-Through? P0 0 4 St or e 1 5 Miss P2 pr i vat e Shared portion 2 6 of address space 3 7

P1 pr i vat e Private portion of address space P0 pr i vat e • Caches are aliases for memory locations • Does every processor eventually see new value? • Add hoc parallelism used in system code • Tightly related: Cache Consistency • Most parallel applications have structured SAS – In what order do writes appear to other processors? • Same program on each processor • Buses make this easy: every processor can snoop on – shared variable X means the same thing to each thread every write – Essential feature: Broadcast 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.11 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.12 Engineering: Pentium Pro Quad Engineering: SUN Enterprise

CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ CPU/mem Bus interface P P cards $ $ P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz) $2 $2 Mem ctrl

PCI PCI Memory Bus interface/switch bridge bridge controller

PCI I/O MIU cards PCI bus PCI bus 1-, 2-, or 4-way Gigaplane bus (256 data, 41 address, 83 MHz) interleaved DRAM I/O cards Bus interface – All coherence and multiprocessing glue in

processor module SBUS SBUS SBUS 100bT, SCSI 100bT, • Proc + mem card - I/O card 2 FiberChannel – Highly integrated, targeted at high volume – 16 cards of either type – All memory accessed over bus, so symmetric – Low latency and bandwidth – Higher bandwidth, higher latency bus

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.13 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.14

Quad-Processor Xeon Architecture Scaling Up

M M M °°°

Omega Network Network General Network Network

$ $ °°° $ M $ M $ °°° M $

P P P P P P

“Dance hall”

– Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar » latencies to memory uniform, but uniformly large – Distributed memory or non- (NUMA) » Construct shared address space out of simple message • All sharing through pairs of front side busses (FSB) transactions across a general-purpose network (e.g. read- – Memory traffic/cache misses through single chipset to memory request, read-response) – Example “Blackford” chipset – Caching shared (particularly nonlocal) data?

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.15 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.16 Stanford DASH The MIT Alewife Multiprocessor • Clusters of 4 processors share 2nd-level cache • Up to 16 clusters tied together with 2-dim mesh • 16-bit directory associated with every memory line – Each memory line has home cluster that contains DRAM – The 16-bit vector says which clusters (if any) have read copies – Only one writer permitted at a time

P P P P L1-$ L1-$ L1-$ L1-$ • Cache-coherence Shared Memory – Partially in Software! L2-Cache – Limited Directory + software overflow • User-level Message-Passing • Rapid Context-Switching • Never got more than 12 clusters (48 processors) • 2-dimentional Asynchronous network working at one time: Asynchronous network probs! • One node/board • Got 32-processors (+ I/O boards) working 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.17 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.18

Engineering: Cray T3E AMD Direct Connect

External I/O

P Mem $

Mem ctrl and NI

XY Switch

Z

– Scale up to 1024 processors, 480MB/s links – Memory controller generates request message for non-local references • Communication over general interconnect – No hardware mechanism for coherence – Shared memory/address space traffic over network » SGI Origin etc. provide this – I/O traffic to memory over network – Multiple topology options (seems to scale to 8 or 16 processor chips) 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.19 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.20 What is underlying Shared Memory?? Message Passing Architectures

Network • Complete computer as building block, including I/O – Communication via Explicit I/O operations M $ M $ °°° M $

P P P • Programming model – direct access only to private address space (local memory), – communication via explicit messages (send/receive) Systolic Network Arrays SIMD • High-level block diagram Generic – Communication integration? Architecture Message Passing » Mem, I/O, LAN, Cluster M $ M $ °°° M $ – Easier to build and scale than SAS Dataflow P P P Shared Memory • Programming model more removed from basic • Packet switched networks better utilize available hardware operations link bandwidth than circuit switched networks – Library or OS intervention • So, network passes messages around!

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.21 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.22

Message-Passing Abstraction Evolution of Message-Passing Machines • Early machines: FIFO on each link – HW close to prog. Model; Match ReceiveY ,P,t – synchronous ops 101 100 AddressY Send X, Q, t – topology central (hypercube )

AddressX 001 000 Local process Local process address space address space 111 110

ProcessP ProcessQ

– Send specifies buffer to be transmitted and receiving process 011 010 – Recv specifies sending process and application storage to receive into – Memory to memory copy, but need to name processes – Optional tag on send and matching rule on receive – User process names local data and entities in process/tag space too – In simplest form, the send/recv match achieves pairwise synch event » Other variants too – Many overheads: copying, buffer management, protection (Seitz, CACM Jan 95)

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.23 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.24 MIT J-Machine (Jelly-bean machine) Diminishing Role of Topology?

• Shift to general links – DMA, enabling non-blocking ops » Buffered by system at destination until recv – Store&forward routing • Fault-tolerant, multi-path routing: • 3-dimensional network topology Intel iPSC/1 -> iPSC/2 -> iPSC/860 – Non-adaptive, E-cubed routing • Diminishing role of topology – Hardware routing – Any-to-any pipelined routing – Maximize density of communication – node-network interface dominates • 64-nodes/board, 1024 nodes total communication time • Low-powered processors » Network fast relative to overhead – Message passing instructions » Will this change for ManyCore? – Associative array primitives to aid in synthesizing shared-address space – Simplifies programming • Extremely fine-grained communication – Allows richer design space – Hardware-supported Active Messages » grids vs hypercubes 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.25 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.26

Example Intel Paragon Building on the mainstream: IBM SP-2

• Made out of essentially i860 i860 Intel Paragon node complete RS6000 L1 $ L1 $ workstations

Memory bus (64-bit, 50 MHz) • Network Power 2 IBM SP-2 node interface CPU L2 $ Mem DMA ctrl integrated in Driver NI I/O bus (bw Memory bus 4-way Sandia’ s Intel Paragon XP/S-based interleaved limited by I/O General interconnection network formed from Memory 4-way DRAM interleaved 8-port switches controller bus) DRAM 8 bits, MicroChannel bus 175 MHz, NIC 2D grid network bidirectional I/O DMA with processing node attached to every switch

i860 NI DRAM

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.27 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.28 Berkeley NOW Data Parallel Systems

• 100 Sun Ultra2 workstations • Programming model – Operations performed in parallel on each element of data structure • Inteligent – Logically single thread of control, performs sequential or parallel steps network – Conceptually, a processor associated with each data element interface • Architectural model – proc + mem – Array of many simple, cheap processors with little memory each • Myrinet Network » Processors don’t sequence through instructions – Attached to a control processor that issues instructions – 160 MB/s per link – Specialized and general communication, cheap global synchronization – 300 ns per hop • Original motivations – Matches simple differential equation solvers Control processor – Centralize high cost of instruction fetch/sequencing

PE PE ° ° ° PE

PE PE ° ° ° PE

° ° ° ° ° ° ° ° °

PE PE ° ° ° PE 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.29 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.30

Application of Connection Machine

– Each PE contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 – Logically, the whole operation is a single step – Some processors enabled for arithmetic operation, others disabled • Other examples: – Finite differences, linear algebra, ... – Document searching, graphics, image processing, ... • Some recent machines: – Thinking Machines CM-1, CM-2 (and CM-5) – Maspar MP-1 and MP-2,

(Tucker, IEEE Computer, Aug. 1988)

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.31 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.32 NVidia Tesla Architecture Combined GPU and general CPU Components of NVidia Tesla architecture • SM has 8 SP thread processor cores – 32 GFLOPS peak at 1.35 GHz – IEEE 754 32-bit floating point – 32-bit, 64-bit integer – 2 SFU special function units • Scalar ISA – Memory load/store/atomic – Texture fetch – Branch, call, return – Barrier synchronization instruction • Multithreaded Instruction Unit – 768 independent threads per SM – HW multithreading & scheduling • 16KB Shared Memory – Concurrent threads share data – Low latency load/store • Full GPU – Total performance > 500GOps

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.33 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.34

Evolution and Convergence CM-5

• SIMD Popular when cost savings of centralized sequencer high • Repackaged – 60s when CPU was a cabinet SparcStation – Replaced by vectors in mid-70s – 4 per board » More flexible w.r.t. memory layout and easier to manage • Fat-Tree – Revived in mid-80s when 32-bit datapath slices just fit on chip network • Simple, regular applications have good locality • Control network for global synchronization • Programming model converges with SPMD (single program multiple data) – need fast global synchronization – Structured global address space, implemented with either SAS or MP

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.35 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.36 Dataflow Architectures • Represent computation as a graph of essential dependences – Logical processor at each node, activated by availability of operands – Message (tokens) carrying tag of next Systolic instruction sent to next processor SIMD Arrays Generic – Tag compared with others in matching

Architecture store;1b match fires ceexecution Message Passing

Dataflow a = (b +1) × (b − c) + − × Shared Memory d = c × e f = a × d d × Dataflow graph a × Network f

Token Program store store

Network Waiting Instruction Execute Form Monsoon (MIT) Matching fetch token

Token queue 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.37 1/28/08 Kubiatowicz CS258Network ©UCB Spring 2008 Lec 2.38

Evolution and Convergence Systolic Architectures

• Key characteristics • VLSI enables inexpensive special-purpose chips – Ability to name operations, synchronization, dynamic scheduling – Represent algorithms directly by chips connected in regular pattern – Replace single processor with array of regular processing elements • Problems – Orchestrate data flow for high throughput with less memory access – Operations have locality across them, useful to group together – Handling complex data structures like arrays – Complexity of matching store and memory units M M – Expose too much parallelism (?) • Converged to use conventional processors and memory PE – Support for large, dynamic set of threads to map to processors PE PE PE – Typically shared address space as well – But separation of progr. model from hardware (like data-parallel) • Different from pipelining • Lasting contributions: – Nonlinear array structure, multidirection data flow, each PE may – Integration of communication with thread (handler) generation have (small) local instruction and data memory – Tightly integrated communication and fine-grained synchronization • SIMD? : each PE may do something different – Remained useful concept for software (compilers etc.)

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.39 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.40 Systolic Arrays (contd.) Toward Architectural Convergence Example: for 1-D convolution • Evolution and role of software have blurred boundary y(i) = w1 × x(i) + w2 × x(i + 1) + w3 × x(i + 2) + w4 × x(i + 3) – Send/recv supported on SAS machines via buffers x8 x6 x4 x2 – Can construct global address space on MP (GA -> P | LA) x7 x5 x3 x1 – Page-based (or finer-grained) shared virtual memory w4 w3 w2 w1 y3 y2 y1 • Hardware organization converging too – Tighter NI integration even for MP (low-latency, high-bandwidth) xin xout x xout = x – Hardware SAS passes messages x= xin w yout = yin + w × xin • Even clusters of workstations/SMPs are parallel systems yin yout – Emergence of fast system area networks (SAN) – Practical realizations (e.g. iWARP) use quite general processors • Programming models distinct, but organizations converging » Enable variety of algorithms on same hardware – Nodes connected by general network and communication assists – But dedicated interconnect channels – Implementations also converging, at least in high-end machines » Data transfer directly from register to register across channel – Specialized, and same problems as SIMD » General purpose systems work well for same algorithms (locality etc.)

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.41 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.42

Convergence: Generic Parallel Architecture Flynn’s Taxonomy

Network • # instruction x # Data – Single Instruction Single Data (SISD) ° ° ° – Single Instruction Multiple Data (SIMD) – Multiple Instruction Single Data Communication Mem assist (CA) – Multiple Instruction Multiple Data (MIMD)

$ • Everything is MIMD! P • However – Question is one of efficiency – How easily (and at what power!) can you do certain operations? • Node: processor(s), memory system, plus – GPU solution from NVIDIA good at graphics – is it good in communication assist general? – Network interface and communication controller • As (More?) Important: communication architecture • Scalable network – How do processors communicate with one another • Convergence allows lots of innovation, within – How does the programmer build correct programs? framework – Integration of assist with node, what operations, how efficiently...

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.43 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.44 Any hope for us to do research in multiprocessing? RAMP • Yes: FPGAs as New Research Platform • Since goal is to ramp up research in • As ~ 25 CPUs can fit in Field Programmable multiprocessing, called Research Gate Array (FPGA), 1000-CPU system from Accelerator for Multiple Processors – To learn more, read “RAMP: Research Accelerator for ~ 40 FPGAs? Multiple Processors - A Community Vision for a Shared • 64-bit simple “soft core” RISC at 100MHz in 2004 (Virtex- Experimental Parallel HW/SW Platform,” Technical II) Report UCB//CSD-05-1412, Sept 2005 • FPGA generations every 1.5 yrs; 2X CPUs, 2X clock rate – Web page ramp.eecs.berkeley.edu • HW research community does logic design • Project Opportunities? (“gate shareware”) to create out-of-the- – Many box, Processor runs – Infrastructure development for research standard binaries of OS, apps – Validation against simulators/real systems – Gateware: Processors, Caches, Coherency, Ethernet – Development of new communication features Interfaces, Switches, Routers, … (IBM, Sun have donated – Etc…. processors) – E.g., 1000 processor, IBM Power binary-compatible, cache-coherent supercomputer @ 200 MHz; fast enough for research 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.45 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.46

Why RAMP Good for Research? RAMP 1 Hardware SMP Cluster Simulate RAMP • Completed Dec. 2004 (14x17 inch 22-layer Cost (1000 CPUs) F ($40M) C ($2M) A+ ($0M) A ($0.1M) PCB) Cost of ownership A D A A • Module: C A A A – FPGAs, memory, 10GigE conn. Power/Space D (120 kw, D (120 kw, A+ (.1 kw, A (1.5 kw, (kilowatts, racks) 12 racks) 12 racks) 0.1 racks) 0.3 racks) – Compact Flash – Administration/ Community D A A A maintenance Observability D C A+ A+ ports: » 10/100 Enet Reproducibility B D A+ A+ » HDMI/DVI Flexibility D C A+ A+ » USB Credibility A+ A+ F A – ~4K/module w/o Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.2 GHz) FPGAs or DRAM GPA C B- B A- Called “BEE2” for Berkeley Emulation Engine 2 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.47 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.48 RAMP Blue Prototype (1/07) Vision: Multiprocessing Watering Hole • 8 MicroBlaze cores / FPGA • 8 BEE2 modules (32 “user” FPGAs) x 4 FPGAs/module = 256 cores @ 100MHz • Full star-connection between RAMP modules Parallel file system Dataflow language/computer Data center in a box Thread scheduling Security enhancements Internet in a box • It works; runs NAS benchmarks Multiprocessor switch design Router design Compile to FPGA Fault insertion to check dependability Parallel languages • CPUs are softcore MicroBlazes (32-bit Xilinx RISC architecture) • RAMP attracts many communities to shared artifact ⇒ Cross-disciplinary interactions ⇒ Accelerate innovation in multiprocessing • RAMP as next Standard Research Platform? (e.g., VAX/BSD Unix in 1980s, x86/Linux in 1990s) 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.49 1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.50

Conclusion • Several major types of communication: – Shared Memory – Message Passing – Data-Parallel – Systolic – DataFlow • Is communication “Turing-complete”? – Can simulate each of these on top of the other! • Many tradeoffs in hardware support • Communication is a first-class citizen! – How to perform communication is essential » IS IT IMPLICIT or EXPLICIT? – What to do with communication errors? – Does locality matter??? – How to synchronize?

1/28/08 Kubiatowicz CS258 ©UCB Spring 2008 Lec 2.51