CSE4351/5351 Parallel Processing

Instructor: Dr. Song Jiang The CSE Department [email protected] http://ranger.uta.edu/~sjiang/CSE4351-5351-summer/index.htm

Lecture: MoTuWeTh 10:30AM - 12:30PM LS 101

Office hours: Monday /2-3pm at ERB 101

1 Outline ▪ Introduction ➢What is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢ Shared address space, message passing, data parallel, data flow ➢ A generic parallel architecture

2 What is Parallel Computer?

“A parallel computer is a collection of processing elements that can communicate and cooperate to solve large problems fast” ------Almasi/Gottlieb

▪ “communicate and cooperate” ➢Node and interconnect architecture ➢Problem partitioning and orchestration ▪ “large problems fast” ➢Programming model ➢Match of model and architecture ▪ Focus of this course ▪ Parallel architecture ▪ Parallel programming models ▪ Interaction between models and architecture

3 What is Parallel Computer? (cont’d)

Some broad issues: • Resource Allocation: – how large a collection? – how powerful are the elements? • Data access, Communication and Synchronization – how are data transmitted between processors? – how do the elements cooperate and communicate? – what are the abstractions and primitives for cooperation? • Performance and Scalability – how does it all translate into performance? – how does it scale?

4 Why Study Parallel Computing

▪ Inevitability of parallel computing ➢ Fueled by application demand for performance • Scientific: weather forecasting, pharmaceutical design, and genomics • Commercial: OLTP, search engine, decision support, data mining • Scalable web servers ➢ Enabled by technology and architecture trends • limits to sequential CPU, memory, storage performance

o parallelism is an effective way of utilizing growing number of transistors. • low incremental cost of supporting parallelism

▪ Convergence of parallel computer organizations ➢ driven by technology constraints and economies of scale • laptops and share the same building block ➢ growing consensus on fundamental principles and design tradeoffs

5 Why Study Parallel Computing (cont’d) • Parallel computing is ubiquitous: ➢ Multithreading ➢ Simultaneous multithreading (SMT) a.k.a. hyper-threading • e.g., Intel® Pentium 4 Xeon ➢Chip Multiprocessor (CMP) a.k.a, multi-core processor • Intel® Core™ Duo, Xbox 360 (triple cores, each with SMTs), AMD Quad-core . • IBM processor with as many as 9 cores used in Sony PlayStation 3, Toshiba HD sets, and IBM Roadrunner HPC. ➢ Symmetrical Multiprocessor (SMP) a.k.a, shared memory multiprocessor • e.g. Intel® Pentium Pro Quad, motherboard with multiple sockets ➢ Cluster-based • IBM Bluegene/L (65,536 modified PowerPC 400, each with two cores) • IBM Roadrunner (6,562 dual-core AMD Opteron® chips and 12,240 Cell chips)

6 Course Coverage

• Parallel architectures Q: which are the dominant architectures? A: small-scale shared memory (SMPs), large-scale distributed memory • Programming model Q: how to program these architectures? A: Message passing and shared memory models • Programming for performance Q: how are programming models mapped to the underlying architecture, and how can this mapping be exploited for performance?

7 Course Administration

• Course prerequisites • Course textbooks • Class attendance • Required work and grading policy • Late policy • Academic honesty

(see details on the syllabus)

8 Outline ▪ Introduction ➢Why is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢Shared address space, message passing data parallel, data flow, systolic ➢A generic parallel architecture

9 Inevitability of Parallel Computing • Application demands: ➢ Our insatiable need for computing cycles in challenge applications • Technology Trends ➢Number of transistors on chip growing rapidly ➢Clock rates expected to go up only slowly • Architecture Trends ➢Instruction-level parallelism valuable but limited ➢Coarser-level parallelism, as in MPs, the most viable approach • Economics ➢Low incremental cost of supporting parallelism

10 Application Demands: Scientific Computing • Large parallel machines are a mainstay in many industries ➢Petroleum • Reservoir analysis

➢Automotive • Crash simulation, combustion efficiency ➢Aeronautics • Airflow analysis, structural mechanics, electromegnetism ➢Computer-aided design ➢Pharmaceuticals • Molecular modeling ➢Visualization • Entertainment • Architecture 2,300 CPU years (2.8 GHz ➢Financial modeling Intel Xeon) at a rate of approximately one hour per • Yield and derivative analysis frame. 11 Simulation: The Third Pillar of Science

Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. Limitations: – Too difficult -- build large wind tunnels. – Too expensive -- build a throw-away passenger jet. – Too slow -- wait for climate or galactic evolution. – Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon – Based on known physical laws and efficient numerical methods.

12 Challenge Computation Examples Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational chemistry • Computational material sciences and nanosciences Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) Business • Financial and economic modeling Defense • Nuclear weapons -- test by simulation • Cryptography

13 Units of Measure in HPC High Performance Computing (HPC) units are: • Flop/s: floating point operations • Bytes: size of data Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Exa = 1018 byte

14 Global Climate Modeling Problem Problem is to compute: f(latitude, longitude, elevation, time) → temperature, pressure, humidity, wind velocity Approach: • Discretize the domain, e.g., a measurement point every 1km • Devise an algorithm to predict weather at time t+1 given t

Source: http://www.epm.ornl.gov/chammp/chammp.html

15 Example: Numerical Climate Modeling at NASA

• Weather forecasting over US landmass: 3000 x 3000 x 11 miles • Assuming 0.1 mile cubic element ---> 1011 cells • Assuming 2 day prediction @ 30 min ---> 100 steps in time scale • Computation: Partial differential equation and finite element approach • Single element computation takes 100 Flops • Total number of : 1011 x 100 x 100 = 1015 (i.e., one peta-flops) • Supposed uniprocessor power: 109 flops/sec (Giga-flops) • It takes 106 seconds or 280 hours. (Forecast nine days late!) • 1000 processors at 10% efficiency → around 3 hours • IBM Roadrunner → 1 second ?! • State of the art models require integration of atmosphere, ocean, sea- ice, land models, and more; Models demanding more computation resources will be applied.

16 High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL Commercial Computing

• Parallelism benefits many applications ➢ Database and Web servers for online transaction processing ➢ Decision support ➢ Data mining and data warehousing ➢ Financial modeling • Scale not necessaily as large, but more widely used • Computational power determines scale of business that can be handled.

18 Outline ▪ Introduction ➢Why is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢Shared address space, message passing data parallel, data flow, systolic ➢A generic parallel architecture

19 Tunnel Vision by Experts

“I think there is a world market for maybe five computers.” – Thomas Watson, chairman of IBM, 1943.

“There is no reason for any individual to have a computer in their home” – Ken Olson, president and founder of Digital Equipment Corporation, 1977.

“640K [of memory] ought to be enough for anybody.” – Bill Gates, chairman of Microsoft,1981.

20 Technology Trends: -processor Capacity

The number of transistors on a chip doubles every 18 months (while the costs are halved).

21 Technology Trends:Transistor Count

22 23 Technology Trends

100 Supercomputers

10 Mainframes Minicomputers

Performance 1

0.1 1965 1970 1975 1980 1985 1990 1995 • exhibits astonishing progress! • Natural building block for parallel computers are also state-of-art

microprocessors. 24 Architecture Trend: Role of Architecture

Clock rate increases 30% per year, while the overall CPU performance increases 50% to 100% per year

Where is the rest coming from? ➢Parallelism likely to contribute more to performance improvements

25 Architectural Trends

Greatest trend in VLSI is an increase in the exploited parallelism

• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit – slows after 32 bit – adoption of 64-bit • Mid 80s to mid 90s: Instruction Level Parallelism (ILP) – pipelining and simple instruction sets (RISC) – on-chip caches and functional units => superscalar execution – Greater sophistication: out of order execution, speculation • Nowadays: – Hyper-threading – Multi-core

26 Phase in VLSI Generation

100,000,000 Instruction-Level Thread-Level Parallelism Parallelism 10,000,000 R10000 Pentium

1,000,000

i80386 i80286 Transistors 100,000 R2000

i8086 10,000 i8080 Bit-Leveli4004 Parallelism1,000 1970 1975 1980 1985 1990 1995 2000 2005 Year

27 ILP Ideal Potential

30 3 ⚫ ⚫ ⚫ 25 2.5

20 2 ⚫

15 1.5 Speedup 10 1 ⚫

Fraction of total cycles total (%) Fraction of 5 0.5

0 0 0 1 2 3 4 5 6+ 0 5 10 15 Number of instructions issued Instructions issued per cycle

– Limited parallelism inherent in one stream of instructions ➢Pentium Pro: 3 instructions, ➢PowerPC 604: 4 instructions – Need to look across threads for more parallelism 28 29 30 31 32 33 TOP500 Supercomputer Sites

34 Technology Trend for Memory and Disk

• Divergence between memory capacity and speed more pronounced ➢ Capacity increased by 1000X from 1980-95, speed only 2X ➢ Larger memories are slower, while processors get faster → “memory wall” – Need to transfer more data in parallel – Need deeper cache hierarchies – Parallelism helps hide memory latency • Parallelism within memory systems too ➢ New designs fetch many bits within memory chip, followed with fast pipelined transfer across narrower interface

35 Technology Trends: Unbalanced system improvements

Latencies of Cache, DRAM and Disk in CPU Cycles 5,000,000

5000000 4500000 4000000 3500000 3000000 2500000 1,666,666 2000000 1500000 451,807 560,000 1.2 2 CPU CyclesCPU 87,000 11.66 1000000 0.3 0.375 0.9 0.7 37.5 500000 2.5 1.25 0 1980 1985 1990 1995 2000 Year SRAM Access Time DRAM Access Time Disk Seek Time The disks in 2000 are more than 57 times “SLOWER” than their ancestors in 1980.

➔ Redundant Inexpensive Array of Disk (RAID) 36 Why Parallel Computing: Economics

▪ Commodity means cost-effectiveness ➢ Development cost ($5 – 100M) amortized over volumes of millions ➢ Building block offers significant cost-performance benefits

▪ Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

➢ Standardization by Intel makes small, bus-based SMPs commodity

➢ Multiprocessing on the desktop (laptop) is a reality

▪ Example: How economics affect platforms for scientific computing?

➢ Large-scale cluster systems replace vector supercomputers

➢ A supercomputer and a desktop share the same building block

37 Evolution of Architectural Models

• Historically (1970s - early 1990s), each parallel machine was unique, along with its programming model and language Architecture = prog. model + comm. abstraction + machine organization

• Throw away software & start over with each new kind of machine

➢ Dead Supercomputer Society: http://www.paralogos.com/DeadSuper/ • Nowadays we separate the programming model from the underlying parallel machine architecture. ➢ 3 or 4 dominant programming models • Dominant: shared address space, message passing, data parallel • Others: data flow, systolic arrays

38 Programming Model for Various Architectures

• Programming models specify communication and synchronization ➢ Multiprogramming: no communication/synchronization ➢ Shared address space: like bulletin board ➢ Message passing: like phone calls ➢ Data parallel: more regimented, global actions on data • Communication abstraction: primitives for implementing the model ➢ Play the role like the instruction set in a uniprocessor computer. ➢ Supported by HW, by OS or by user-level software • Programming models are the abstraction presented to programmers ➢ Write portably correct code that runs on many machines ➢ Writing fast code requires tuning for the architecture – Not always worthy of it – sometimes programmer time is more precious

39 Aspects of a parallel programming model • Control ➢How is parallelism created? ➢In what order should operations take place? ➢How are different threads of control synchronized? • Naming ➢What data is private vs. shared? ➢How is shared data accessed? • Operations ➢What operations are atomic? • Cost ➢How do we account for the cost of operations?

40 Programming Models: Shared Address Space

Virtual address spaces for a Machine physical address space collection of processes communicating via shared addresses Pn p r i v a t e

L o a d Pn Common physical P2 addresses P1

P0

St o r e

P2 p r i v a t e Shared portion of address space

P p r i v a t e Private portion 1 of address space

P0 p r i v a t e •Programming model ➢Process: virtual address space plus one or more threads of control; ➢Portions of address spaces of processes are shared ➢Writes to shared address visible to all threads (in other processes as well) •Natural extension of uniprocess model: ➢ conventional memory operations for communication ➢ special atomic operations for synchronization

41 SAS Machine Architecture

• Motivation: Programming convenience ➢ Location transparency: • Any processor can directly reference any shared memory location • Communication occurs implicitly as result of loads and stores ➢ Extended from time-sharing on uni-processors – Processes can run on different processors – Improved throughput on multi-programmed workloads • Communication hardware also natural extension of uniprocessor ➢ Addition of processors similar to memory modules, I/O controllers

42 SAS Machine Architecture (Cont’d) One representative architecture: SMP: ➢Used to mean Symmetric MultiProcessor ➔All CPUs had equal capabilities in every area, e.g. in terms of I/O as well as memory access ➢Evolved to mean Shared Memory Processor ➔ non-message-passing machines (included crossbar as well as bus based systems) ➢Now it tends to refer to bus-based shared memory machines ➔Small scale: < 64 processors typically

P1 P2 Pn

network

memory

43 Example: Intel Pentium Pro Quad

CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ Bus interface

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

PCI PCI Memory bridge bridge controller

PCI I/O MIU

cards PCI bus PCI PCI bus PCI 1-, 2-, or 4-way interleaved DRAM

• All coherence and multiprocessing glued in processor module • Highly integrated, targeted at high volume • Low latency and high bandwidth

44 Scaling Up: More SAS Machine Architectures M M M   

Network Network

$ $    $ M $ M $    M $

P P P P P P “Dance hall” Distributed Shared memory

• Dance-hall: ➢ Problem: interconnect cost (crossbar) or bandwidth (bus) ➢ Solution: scalable interconnection network ➔Bandwidth scalable ➢ latencies to memory uniform, but uniformly large (Uniform Memory Access (UMA)) ➢ Caching is key: coherence problem

45 Scaling Up: More SAS Machine Architectures

M M M   

Network Network

$ $    $ M $ M $    M $

P P P P P P

“Dance hall” Distributed Shared memory

• Distributed shared memory (DSM) or non-uniform memory access (NUMA) ➢ Non-uniform time for the access to data in local memory and remote memory ➢ Caching of non-local data is key • Coherence cost 46 Example: SUN Enterprise

CPU/mem P P cards $ $

$2 $2 Mem ctrl Bus interface/switch

Gigaplane bus (256 data, 41 address, 83 MHz)

I/O cards

Bus interface

, SCSI ,

SBUS

SBUS

SBUS

100bT 2 FiberChannel 2

• 16 cards of either type: processors + memory, or I/O • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus

47 Example: Cray T3E

External I/O

P Mem $

Mem ctrl and NI

XY Switch

Z

• Scale up to 1024 processors, 480MB/s links • Memory controller generates comm. request for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this)

48 Programming Models: Message Passing

Process P Match Process Q

, Receive Y, P, t Address Y Send X, Q, t Address X Local Process Local Process Address Space Address Space

• Programming model ➢ Directly access only private address space (local memory), communicate via explicit messages • Send specifies data in a buffer to transmit to the receiving process • Recv specifies sending process and buffer to receive data ➢ In the simplest form, the send/recv match achieves pair-wise synchronization • Model is separated from basic hardware operations ➢ Library or OS support for copying, buffer management, protection ➢ Potential high overhead: large messages to amortize the cost 49 Message Passing Architectures

• Complete processing node (computer) as building block, including I/O ➢ Communication via explicit I/O operations ➢ Processor/Memory/IO form a processing node that cannot directly access another processor’s memory. • Each “node” has a network interface (NI) for communication and synchronization.

P1 NI P2 NI Pn NI

memory memory . . . memory

interconnect

50 DSM vs Message Passing P1 NI P2 NI Pn NI

memory memory . . . memory

interconnect • High-level block diagrams are similar

• Programming paradigms that theoretically can be supported on various parallel architectures;

• Implication of DSM and MP on architectures: ➢Fine-grained hardware supports for DSM; • Communication integrated at I/O level for MP, needn’t be into memory system ➢MP can be implemented as middleware (library); ➢MP has better scalability. • MP machines are easier to build than scalable address space machines

51 Example: IBM SP-2

Power 2 CPU IBM SP-2 node

L2 $

Memory bus

General interconnect network formed from Memory 4-way interleaved 8-port switches controller DRAM

MicroChannel bus NIC

I/O DMA

i860 NI DRAM

• Each node is a essentially complete RS6000 workstation; • Network interface integrated in I/O bus (bw limited by I/O bus).

52 Example Intel Paragon

i860 i860 Intel Paragon node L1 $ L1 $

Memory bus (64-bit, 50 MHz)

Mem DMA ctrl Driver NI 4-way Sandia’s Intel Paragon XP/S-based Supercomputer interleaved DRAM

8 bits, 175 MHz, 2D grid network bidirectional with each processing node attached to a switch

53 Toward Architectural Convergence

• Convergence in hardware organizations ➢Tighter NI integration for MP ➢Hardware SAS passes messages at lower level ➢Cluster of workstations/SMP become the most popular parallel architecture for parallel systems • Programming models distinct, but organizations converging ➢Nodes connected by general network and communication assists ➢Implementations also converging, at least in high-end machines

54 Programming Model: Data Parallel

➢ Operations performed in parallel on each element of data structure

➢ Logically single thread of control (sequential program)

➢ Conceptually, a processor associated with each data element

➢ Coordination is implicit – statements executed synchronously

➢ Example:

float x[100]; for (i=0; i<100; i++) ➔ x = x + 1; x[i] = x[i] + 1;

55 Programming Model: Data Parallel

• Architectural model: ➢ A control processor issues instructions ➢ Array of many simple cheap processors— processing element (PE)—each with little memory ➢ A interconnect network that broadcasts data to PEs, communication among PEs, and efficient synchronization. • Motivation: ➢ Give up flexibility (different instructions in different processors) to allow a much larger number of processors; ➢ Target at limited scope of applications. • Applications: ➢ Finite differences, linear algebra. ➢ Document searching, graphics, image processing, . 56 A Case of DP: Vector Machine An example vector instruction • Vector machine: vr1 … vr2 … ➢Multiple functional units + (logically, performs # elts adds in parallel) ➢All performing the same operation vr3 … ➢Instructions may be of very high parallelism (e.g., 64-way) but hardware executes only a subset in parallel at a time • Historically important, but overtaken by MPPs in the 90s • Re-emerging in recent years ➢ At a large scale in the Earth Simulator (NEC SX6) and Cray X1 ➢ At a small sale in SIMD media extensions to microprocessors – SSE (Streaming SIMD Extensions) , SSE2 (Intel: Pentium/IA64) – Altivec (IBM/Motorola/Apple: PowerPC) – VIS (Sun: Sparc) • Enabling technique ➢Compiler does some of the difficult work of finding parallelism

57 Flynn's Taxonomy

A classification of computer architectures based on the number of streams of instructions and data:

• Single instruction/single data stream (SISD) - a sequential computer. • Multiple instruction/single data stream (MISD) - unusual. • Single instruction/multiple data streams (SIMD) – - e.g. a vector processor. • Multiple instruction/multiple data streams (MIMD) - multiple autonomous processors simultaneously executing different instructions on different data. ➔ Program model converges with SPMD (single program multiple data)

58 Clusters have Arrived

59 What’s a Cluster?

• Collection of independent computer systems working together as if a single system. • Coupled through a scalable, high bandwidth, low latency interconnect.

60 Clusters of SMPs SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network Common names: • CLUMP = Cluster of SMPs What is the right programming model? • Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). • Shared memory within one SMP, but message passing outside of an SMP.

61 Convergence: Generic Parallel Architecture • A generic modern multiprocessor

Network

  

Communication Mem assist (CA)

$

P • Node: Processor(s), memory, plus communication assist ➢ Network interface and communication controller • Scalable network ➔ Convergence allows lots of innovation, now within the same framework

➢ integration of assist within node, what operation, how efficiently …

62 Lecture Summary

▪ Parallel computing A parallel computer is a collection of processing elements that can communicate and cooperate to solve large problems fast • Parallel computing has become central and mainstream ➢ Application demands ➢ Technology and architecture trends ➢ Economics ▪ Convergence in parallel architecture ➢ initially: close coupling of programming model and architecture • Shared address space, message passing, data parallel ➢ now: separation and identification of dominant models/architectures • Programming models: shared address space message passing, and data parallel • Architectures: small-scale shared memory, large-scale distributed memory, large-scale SMP cluster.

63