Introduction to Introduction, problems, models, performance

Jesper Larsson Träff [email protected] TU Wien Parallel Computing

WS16/17 ©Jesper Larsson Träff Practical Parallel Computing

Parallelism everywhere! How to use all these resources? Limits?

June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)

Mobile phones: „dual core“, „quad core“ (2012?), …

…octa-core (2016, Samsung Galaxy 7)

“Never mind that there's little-to-no software that can take advantage of four June 2012: IBM processing cores, Xtreme Notebooks has BlueGene/Q, released the first quad-core laptop in the 1572864 cores, U.S (2007)”. 16PF WS16/17 ©Jesper Larsson Träff Practical Parallel Computing

June 2011: Fujitsu K, 705024 cores, 11PF

As per ca. 2010: Why? No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …

WS16/17 ©Jesper Larsson Träff Practical Parallel Computing

June 2011: Fujitsu K, 705024 cores, 11PF

Challenge: How do I speed up windows…?

As per 2010: a) All applications must be parallelized; or b) enough independent applications that can run at the same time

WS16/17 ©Jesper Larsson Träff Practical Parallel Computing

“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”

7 myths about quad-core phones (Smartphones Unlocked) by Jessica Dolcourt April 8, 2012 12:00 PM PDT …many similar quotations can (still) be found

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

WS16/17 ©Jesper Larsson Träff The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

“Free lunch” (as in “There is no such thing as a free lunch”):

Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors

WS16/17 ©Jesper Larsson Träff Kunle Olukotun (Stanford), ca. 2010 “The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties

But: Numbers of transistors/chip can continue to increase (>1Billion)

Solution(?): Put more cores on chip „multi-core revolution“

WS16/17 ©Jesper Larsson Träff Kunle Olukotun (Stanford), ca. 2010 “The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties

Parallelism challenge: Solve problem p times faster on p (slower, more energy efficient) cores

WS16/17 ©Jesper Larsson Träff Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015

WS16/17 ©Jesper Larsson Träff Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com

Average, normalized, SPEC benchmark numbers, see www.spec.org

WS16/17 ©Jesper Larsson Träff From Hennessy/Patterson, Computer Architecture

2006: Factor 103-104 increase in integer performance over 1978 high- performance processor

Single-core processor development (“free lunch”) made life very difficult for parallel computing in the 90ties WS16/17 ©Jesper Larsson Träff Architectural ideas driving sequential performance increase

Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:

•Deep pipelining •Superscalar execution (>1 instruction/clock) through multiple- functional units… • through SIMD units (old HPC idea: vector processing), same instruction on multiple data/clock •Out-of-order execution, speculative execution •Branch prediction •Caches (pre-fetching) •Simplified/better instruction sets (better for compiler) Mostly fully transparent, at most •SMT/Hyperthreading the compiler needs to care: the “free lunch” WS16/17 ©Jesper Larsson Träff •Very diverse parallel architectures •No single, commonly agreed upon abstract model (for designing and analyzing algorithms) •Has been so •Is still so (largely)

Many different programming paradigms (models), different programming frameworks (interfaces, languages)

WS16/17 ©Jesper Larsson Träff Theoretical Parallel Computing

Parallelism was always there! What can be done? Limits?

Assume p processors instead of just one, reasonably connected (memory, network, …)

•How much master can some given problem be solved? Can some problems be solved better? •How? New algorithms? New techniques? •Can all problems be solved faster? Are there problems that cannot be solved faster with more processors? •Which assumptions are reasonable? •Does parallelism give new insights into nature of computation?

WS16/17 ©Jesper Larsson Träff Sequential computing vs. Parallel computing

Algorithm Algorithm Algorithm in model A in model B in model Z Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…) Concrete program: different paradigms (MPI, Concrete architecture OpenMP, , OpenCL, …) is difficult. Analysis in model (often) has some relation to concrete Conc. Arch. A … Conc. Arch. Z execution

WS16/17 ©Jesper Larsson Träff Parallel computing

Algorithm Algorithm Algorithm in model A in model B in model Z

Huge gap: no „standard“ abstract model (e.g., RAM), Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …) is extremely difficult. Analysis in model may have little relation to concrete Conc. Arch. A … Conc. Arch. Z execution

WS16/17 ©Jesper Larsson Träff Extremely difficult. Parallel computing Analysis in model may have little relation to concrete Algorithm Algorithm Algorithm execution in model A in model B in model Z

Challenges:

•Algorithmic: Not all problems seems to be easily parallelizable Concrete program: •Portability: Support for different paradigms (MPI, same language/interface OpenMP, Cilk, OpenCL, …) on different architectures (e.g. MPI in HPC) •Performance portability Conc. Arch. A … Conc. Arch. Z (?)

WS16/17 ©Jesper Larsson Träff Elements of Parallel Parallel computing computing: Algorithm Algorithm Algorithm •Algorithmic: Find the in model A in model B in model Z parallelism •Linguistic: Express the parallelism •Practical: Validate the parallelism (correctness, performance) Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …) •Technical challenge: run- time/compiler-support support for paradigm Conc. Arch. A … Conc. Arch. Z

WS16/17 ©Jesper Larsson Träff Parallel computing: Accomplish something with a coordinated set of processors under control of a

Why study parallel computing?

•It is inevitable: multi-core revolution, GPGPU paradigm, …

•It‘s interesting, challenging, highly non-trivial – full of surprises

•Key discipline of (von Neumann, golden theory decade: 1980-early 90ies) •It‘s ubiquituous (gates, architecture: pipelines, ILP, TLP, systems: operating systems, software), not always transparent •It‘s useful: large, extremely computationally intensive problems, Scientific Computing, HPC •… WS16/17 ©Jesper Larsson Träff Parallel computing: The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.

Specifically: Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands

Focus on properties of solution (time, size, energy, …) to given, individual problem

Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …

WS16/17 ©Jesper Larsson Träff : The discipline of making independent, non-dedicated resources available and coorperative toward solving specified problem complexes.

Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …

Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …

WS16/17 ©Jesper Larsson Träff : The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously

Typical concerns: correctness (often formal), e.g. - freedom, starvation-freedom, mutual exclusion, fairness

Buzz words: operating systems, synchronization, interprocess communication, locks, semaphores, autonomous computing, calculi, CSP, CCS, pi-calculus, …

WS16/17 ©Jesper Larsson Träff Adopted from Madan Musuvathi Parallel vs. Concurrent computing

Given problem: specification, algorithm, data

Process Process Process … Process

Proc Resource Proc

Memory (locks, semaphores, data structures), device, …

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

WS16/17 ©Jesper Larsson Träff Coordination: Specification, Problem synchronization, algorithm, data communication

Subproblem Subproblem Subproblem Subproblem

Proc Proc Proc Proc

Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)

WS16/17 ©Jesper Larsson Träff The “problem” of parallelization

How to divide given problem into subproblems that can be solved in parallel?

•Specification Problem •Algorithm? •Data?

Subproblem Subproblem Subproblem Subproblem

•How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice? •Where are the data? Is communication necessary?

WS16/17 ©Jesper Larsson Träff Aspects of parallelization

•Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?

•Scheduling/Mapping: How are independent parts of the computation assigned to processors?

•Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently?

•Communication: When must processors communicate? How? •Synchronization: When must processors agree/wait?

WS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)

Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model?

Architectural: which kinds of parallel machines can be realized? How do they look?

WS16/17 ©Jesper Larsson Träff Levels of parallelization

Architecture: gates, logical and functional units Not

here Computational: Instruction Level Parallelism (ILP) SIMD

Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors

“Parallel computing”

Large-scale: coarse grained tasks, multi-level, coupled Not application parallelism here

WS16/17 ©Jesper Larsson Träff Automatic parallelization(?)

Can’t we just leave it all to the compiler/hardware?

•“high-level” problem specification •Sequential program

Efficient code that can be executed on parallel, multi-core processor

Successsful only to a limited extent: •Compilers cannot invent a different algorithm •Hardware parallelism not likely to go much further Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012 WS16/17 ©Jesper Larsson Träff Explicitly, parallel programming (today):

•Explicitly parallel code in some parallel language •Support from parallel libraries •(Domain specific languages)

•Compiler does as much as compiler can do

Lot’s of interesting problems and tradeoffs, active area of research

WS16/17 ©Jesper Larsson Träff Some „given, individual problems“ for this lecture

Problem 1:

Matrix-vector multiplication: given nxm matrix A, m-column

vector v, compute n-column vector u = Av … A[i,j] u[i] = ∑A[i,j]v[j] x =

Dimensions n,m …

large: obviously some parallelism How to ∑ in parallel? Access to vector?

WS16/17 ©Jesper Larsson Träff Problem 1a:

Matrix-Matrix multiplication: given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]

WS16/17 ©Jesper Larsson Träff Problem 1b:

Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b

Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)

WS16/17 ©Jesper Larsson Träff Use: discretization and solution of certain parallel differential equations Problem 2: (PDEs); image processing; …

Stencil computation, given nxm matrix A, update as follows, and iterate until some convergence criteria is fulfilled

with suitable handling of matrix border iterate { 5-point for all (i,j) { stencil A[i,j] A[i,j] <- in 2d (A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4 } } until (convergence)

Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?

WS16/17 ©Jesper Larsson Träff A: 1 2 3 4 5 6 7 … All prefix-sums B: x 1 3 6 10 15 21 28 …

Problem 3:

Merging two sorted arrays of size n and m into a sorted array of size n+m Easy to do sequentially, but sequential algorithm looks… sequential Problem 4:

Computation of all prefix-sums: given an array A of size n of elements of some type S with an associative operations +, compute for all indices 0≤i

Implies solution to problem of B[i] = ∑0≤j

WS16/17 ©Jesper Larsson Träff Problem 5:

Sorting a sequence of objects („reals“, integers, objects with order relation) stored in array Hopefully parallel merge solution can be of help. Other approaches? Quicksort? Integers (counting, bucket, radix)

WS16/17 ©Jesper Larsson Träff Parallel computing as a (theoretical) CS discipline

(Traditional) objective: Solve a given computational problem faster

•Than what?

•What is a reasonable „model“ for parallel computing? •How fast can a given problem be solved? How many resources can be productively exploited?

•Are there problems that cannot be solved in parallel? Fast? At all? •…

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) M Processor (ALU, PC, registers) capable of executing instructions stored in memory on P data in memory

Realistic? Execution of instruction, access to memory: first assumption: unit cost Useful? WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) M •Memory: load, store Traditional •Arithmetic (integer, real numbers) RAM: Same •Logic: and, or, xor, shift unit cost P •Branch, compare, procedure call for all operations

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) M Aka von Neumann architecture or stored program computer

P John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) M „von Neumann bottleneck“: Program and data separate from CPU, processing rate limited by memory rate. P

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) M John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style P and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Example: RAM (Random-Access Machine) with M cache/memory hierarchy

Different (NOT unit) cost model, some cache memory accesses cheaper than other

P

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

M M M M

P Increased memory rate, vector computer, ALU operates on vectors instead of scalars

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

M Shared-memory model, bus based

P P P P

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Shared-memory M model (network based, emulated)

P P P P

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Synchronous, M shared-memory Realistic? model Useful? P P P P

Processors in lock-step, time = instruction time: Parallel RAM (PRAM) WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Synchronous M shared-memory model

P P P P

PRAM main theoretical model, from late 70ties, throughout 80ties, lost interest ca. 1993. Very important analysis/idea tool WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Shared-memory M M M … M model, banked, network based, not synchronous

P P P P

WS16/17 ©Jesper Larsson Träff UMA (Uniform Memory Access): Access time to memory location is independent of location and accessing processor, e.g., O(1), O(log M), …

M

P P P P

NUMA (Non-Uniform Memory Access): Access time dependent on processor and location. Locality: some locations can be accessed faster by a processor than other

M M M M

P P P P

WS16/17 ©Jesper Larsson Träff Architecture model defines resources, describes •Composition of processor, functional units: ALU, FPU, registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX) •Types of instructions •Memory system, caches •…

Execution model/cost model specifies •How instructions are executed •(relative) Cost of instructions, memory accesses •…

Level of detail/formality dependent on purpose: What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)

WS16/17 ©Jesper Larsson Träff Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Distributed M M M M memory model

P P P P

Communication network

WS16/17 ©Jesper Larsson Träff Parallel architecture model defines

•Synchronization between processors •Synchronization operations •Atomic operations, shared resources (memory, registers) •Communication mechanisms: network topology, properties •Memory: shared, distributed, hierarchical, … •…

Cost model defines •Cost of synchronization, atomic operations •Cost of communication (latency, bandwidth, …) •…

WS16/17 ©Jesper Larsson Träff Different parallel architecture model: Cellular automaton, , … : simple processors without memory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only

State of cell (i,j) in next step determined by •Own state •State of neighboors in some neighborhood, e.g., (i,j-1),(i+1,j), (i,j+1),(i-1,j)s

[John on Neumann, Arthur W. Burks: Theory of self-reproducing automata, 1966] [H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37- 46, 1982]

WS16/17 ©Jesper Larsson Träff Flynn‘s taxonomy: Orthognal classification of (parallel) architectures/models

Intruction stream

SISD MISD

Single Instruction Single Data Multiple Instruction Single Data stream SIMD MIMD

Single Instruction Multiple Data Multiple Instruction Multiple Data Data

[M. J. Flynn: Some computer organizations and their effectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972]

WS16/17 ©Jesper Larsson Träff SISD: Single processor, single stream of instructions, operates on single stream of data. Sequential architecture (e.g. RAM)

SIMD: Single processor, single stream of operations, operates on multiple data per instruction. Example: traditional vector computer, PRAM (some variants)

MISD: Multiple instructions operate on single data stream. Example: pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea). Some say: Empty

MIMD: Multiple instruction streams, multiple data streams

WS16/17 ©Jesper Larsson Träff Typical instances

M M M M M

P SISD P SIMD

M M M … M M M M M

P P P P P P P P

Communication network MIMD

WS16/17 ©Jesper Larsson Träff Programming model: Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelization paradigms, memory structure, memory model, synchronization and communication features, and their semantics

Parallel programming language, or library („interface“) is the concrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)

Cost of operations: May be specified with programming model; often by architecture/computational model

Execution model: when and how parallelism in programming model is effected

WS16/17 ©Jesper Larsson Träff Parallel programming model defines, e.g.,

•Parallel resources, entities, units: processes, threads, tasks, … •Expression of parallelism: explicit or implicit •Level and granularity of parallelism

•Memory model: shared, distributed, hybrid •Memory semantics („when operations take effect/become visible“) •Data structures, data distributions

•Methods of synchronization (implicit/explicit) •Methods and modes of communication

WS16/17 ©Jesper Larsson Träff Examples:

1. Threads, , block distributed arrays, fork-join parallelism 2. , explicit message passing, collective communication, one-sided communication („RDMA“), PGAS(**) 3. Data parallel SIMD, SPMD(*) 4. …

Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …

(*)SPMD: Single Program, Multiple Data (**)PGAS: Partitioned Global Address Space – not in this lecture

WS16/17 ©Jesper Larsson Träff Same Program Multiple Data (SPMD)

Restricted MIMD model: All processors execute same program

•May do so asynchronously, different processors may be in different parts of program at any given time •Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense

Programming model concepts (not this lecture): active messages, remote procedure call, … (MPI: RMA/one-sided communication)

[F.Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988]

WS16/17 ©Jesper Larsson Träff OpenMP MPI Cilk Programming language/library/interface/paradigm

Programming model Different architectures models can realize given programming model Algorithms support, Closer fit: more efficient use of „run-time“ architecture Architecture model Challenge: Programming model that is useful and close to „realistic“ architecture models to enable realistic analysis/prediction „Real“ Hardware Challenge: Language that conveniently realizes programming model

WS16/17 ©Jesper Larsson Träff Examples: OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; „Data parallel“

Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures – but performance has usually not been good. Requires DSM implementation/algorithms

MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)

Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)

WS16/17 ©Jesper Larsson Träff Performance: Basic observations and goals

Model: p dedicated parallel processors collaborate to solve given problem of input size n. We will be interested in worst-case complexities (execution time)

•Processors can work independently (local memory, program)

•Communication and coordination incurs some overhead

WS16/17 ©Jesper Larsson Träff Challenge and one main goal of parallel processing

… not to make life easy

Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently

WS16/17 ©Jesper Larsson Träff Main goal: Speeding up computations by parallel processing

Tseq(n): time for 1 processor to solve problem of size n

Tpar(p,n): time for p processors to solve problem of size n

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup measures the gain in moving from sequential to parallel computation (note: parameters p and n)

Goal: Achieve as large speed-up as possible

WS16/17 ©Jesper Larsson Träff What is „time“ (number of instructions)?

What exactly is Tseq(n), Tpar(p,n)?

-Time for some algorithm for solving problem? -Time for a specific algorithm for solving problem?

-Time for best known algorithm for problem? -Time for best possible algorithm for problem?

-Time for specific input of size n, average case, worst case, …? -Asymptotic time, large n, large p?

-Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?

WS16/17 ©Jesper Larsson Träff Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)

Tseq(n):

•Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n. •The number of instructions carried out is often termed WORK

•Practice: Measured time (or other parameter) of execution over some inputs (experiment design)

Theory and practice: Always state baseline sequential algorithm/implementation

WS16/17 ©Jesper Larsson Träff Examples (theory):

Tseq(n) = O(n): Finding maximum of n numbers in unsorted array; prefix-sums Tseq(n,m) = O(n+m): Merging of two sequences; BFS/DFS in graph Tseq(n) = O(n log n): Comparison-based sorting Tseq(n,m) = O(n log n + m): Single-source Shortest Path (SSSP) Tseq(n) = O(n3): Matrix multiplication, input two nxn matrices Can be solved in o(n3), … … Strassen etc.

Standard, worst-case, asymptotic complexities

Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd ed., MIT Press, 2009

WS16/17 ©Jesper Larsson Träff Practice: •Construct good input examples to measure running timeTseq(n); experimental methodology •Worst-case not always possible, not always interesting; best case? •Experimental methods to get stable, accurate Tseq(n): Repeat measurements many times (thumb rule: average over at least 30 repetitions)

New issue with modern processors: Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control

Experimental science: Always some assumptions about repeatability, regularity, determinism, …

WS16/17 ©Jesper Larsson Träff Parallel performance in theory

Definition: Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is

Sp(n) = Tseq(n)/Tpar(p,n)

Observation (proof follows): Best-possible, absolute speed-up is linear in p

Goal: Obtain (linear) for as large p as possible (as function of problem size n), for as many n as possible

WS16/17 ©Jesper Larsson Träff Definition: T∞(n): the smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Speedup is limited by

Sp(n) = Tseq(n)/Tpar(p,n) ≤ Tseq(n)/T∞(n)

Definition: Tseq(n)/T∞(n) is called the parallelism of the algorithm.

The parallelism is the largest number of processors that can be employed and still give linear speedup (assume Tseq(n)/T∞(n)

WS16/17 ©Jesper Larsson Träff For speedup (and other complexity measures), distinguish:

•Problem G to be solved (mathematical specification)

•Some algorithm A to solve G •Best possible (lower bound) algorithm A* for G, best known algorithm A+ for G: The complexity of G

•Implementation of A on some machine M

WS16/17 ©Jesper Larsson Träff Example: “data parallel” (SIMD) computation

Algorithm/ for (i=0; i

Problem: sum of two n-element vectors

Subproblem: Subproblem: Subproblem: Subproblem: Sum of n/p sum of n/p sum of n/p sum of n/p elements elements elements elements

Tpar(p,n) = Tseq(n)/p Best possible parallelization: sequential work divided evenly Speedup(p) = p across p processors

WS16/17 ©Jesper Larsson Träff Example: “data parallel” (SIMD) computation

Algorithm/ for (i=0; i

Problem: sum of two n-element vectors

Subproblem: Subproblem: Subproblem: Subproblem: Sum of n/p sum of n/p sum of n/p sum of n/p elements elements elements elements

•Perfect speedup •“Embarassingly parallel” Tpar(p,n) = c(n/p) for constant c≥1 •“Pleasantly parallel”

WS16/17 ©Jesper Larsson Träff Time Tseq(n) The work, measured in instructions and/or stop time, that has to be carried out for problem of size n

Work

start

WS16/17 ©Jesper Larsson Träff Time Tseq(n) Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p

Perfect speedup

Sp(n) = Tseq(n)/(Tseq(n)/p) = p

… and very rare in practice

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) Sequential work is unevenly divided between p processors: Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)

Define Tpar(p,n) = max Ti(n) over all processors

Tpar is time for slowest processor stop to complete, all processors assumed to start at same time

Ti start

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: total time in which the p processors are reserved (and has to be paid for)

Area C(n) = p*Tpar(p,n)

Ti

p processors WS16/17 ©Jesper Larsson Träff “Theorem:” Perfect speedup Sp(n) = p is best possible and cannot be exceeded

“Proof”: Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.

Reminder: Speedup is calculated (measured) relative to “best” sequential implementation/algorithm

WS16/17 ©Jesper Larsson Träff Time Tseq(n)

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterations

Simulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …

Ti

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n)

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Both simulations a new, sequential algorithm Tsim(n) with Tsim(n)

Ti This contradicts that Tseq(n) was time of best possible/best known sequential algorithm

1 processor WS16/17 ©Jesper Larsson Träff Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.

Crucial assumptions: Sequential simulation possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, … NOT TRUE for real systems and real problems

Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors

Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986

WS16/17 ©Jesper Larsson Träff Time Tseq(n) Aside: Such simulations are actually sometimes done, and can be very useful to understand (model) and debug parallel algorithms

Some such simulation tools:

Ti •SimGrid (INRIA) •LogPOPSim (Hoefler et al.) •…

Possible bachelor thesis subject

1 processor WS16/17 ©Jesper Larsson Träff The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied

Definition: Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup

Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors

Definition: Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)

WS16/17 ©Jesper Larsson Träff Proof (linear-speed up of cost-optimal algorithm): Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))

This implies Tpar(p,n) = c*Tseq(n)/p, so

Sp(n) = Tseq(n)/Tpar(p,n) = p/c

The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect

WS16/17 ©Jesper Larsson Träff Time Tseq(n)

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Ti

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n)

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))

Ti

p processors WS16/17 ©Jesper Larsson Träff Proof idea (work-optimal algorithm can have linear speed-up):

1. Work-optimal algorithm 2. Schedule work-items Ti(n) on p processors, such that p*Tpar(p,n) = O(Tseq(n)) 3. With this number of processors, algorithm is cost-optimal 4. Cost-optimal algorithms have linear speed-up

The scheduling in step 2 is possible in principle, but may not be trivial

Parallel algorithms’ design goal: Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)

WS16/17 ©Jesper Larsson Träff Example: Non work-optimal algorithm

DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)

Well-known that Tseq(n) = O(n log n), with many algorithms and good implementations

2 Sp(n) = n log n/n /p = p(log n)/n (Small) linear speedup for fixed n (but not independent of n)

Non work-optimal algorithm: Speed-up decreases with n

Break-even: When is parallel algorithm faster than sequential?

Tpar(p,n) < Tseq(n)  n2/p < n log n  n/p < log n  p > n/log n

WS16/17 ©Jesper Larsson Träff Lesson: Usually does not make sense to parallelize an inferior algorithm ( although sometimes much easier) Best known/best possible parallel algorithm often difficult to parallelize - no redundant work (that could have been done in parallel) - tight dependencies (that forces things to be done one after another) Lesson from much hard work in theory and practice: Parallel solution of a given problem often requires a new algorithmic idea! But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not? Also: non-work optimal algorithm can sometimes be useful, as subroutine

WS16/17 ©Jesper Larsson Träff Parallel performance in practice

Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)

Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs

Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs

Sp(n) = Tseq(n)/Tpar(p,n)

Speed-up is typically not independent of problem size n, and problem instance

WS16/17 ©Jesper Larsson Träff Time Tseq(n) Parallelization most often incurs overheads: •Algorithmic: Parallel algorithm may do more work •Coordination: Communication and synchronization •…

Tpar(p,n) Note T(1,n)≥Tseq(n)

Oi

Ti

p processors WS16/17 ©Jesper Larsson Träff The ratio Tpar(1,n)/Tpar(p,n) is often termed relative speedup, and may express how well the parallel algorithm utilizes p processors ()

Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization. Absolute speed-up is what eventually matters

Note: Literature is not always clear about this distinction. It is easier to achieve and document good relative speedup. Reporting speed-up relative to an inferior, sequential implementation is incorrect!

WS16/17 ©Jesper Larsson Träff Absolute vs. relative speedup and scalability

Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)

Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p

…is sometimes reported.

But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O(n log n/p + log n)?

Even when Tpar(1,n) = 100Tseq(n) = O(Tseq(n)), it would take at least 200 processors to be as fast as the sequential algorithm

WS16/17 ©Jesper Larsson Träff Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading

David H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers", Supercomputing Review, Aug. 1991, pp. 54-55

WS16/17 ©Jesper Larsson Träff Time Tseq(n) If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but

still linear, Sp(n) = p/k

Tpar(p,n) Note T(1,n)≥Tseq(n)

Oi

Ti

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tpar(p,n) Note T(1,n)≥Tseq(n)

Oi Idle

Ti

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle Oi

Ti Idle

Oi

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) Typical overhead by communication and coordination.

The (smallest) time between coordination periods is called the granularity of the parallel computation

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle Oi

Ti Idle

Oi

p processors WS16/17 ©Jesper Larsson Träff Granularity:

•“Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work) •“Fine grained” parallel computation/algorithm: Time/number of instructions between… is small

WS16/17 ©Jesper Larsson Träff Time Tseq(n) Definition: Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle Oi

Ti Idle

Oi

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p

This is the best possible parallel time

Tpar(p,n) close to Tseq(n)/p

Oi Idle

Ti

p processors WS16/17 ©Jesper Larsson Träff Time Tseq(n) = (s+r)Tseq(n)

Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction

Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p

Sequential overhead: constant fraction

Ti

p processors WS16/17 ©Jesper Larsson Träff Amdahls Law (parallel version): Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n

Proof:

Tseq(n) = (s+r)*Tseq(n) Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p

Sp(n) = Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞

G. Amdahl: Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967 WS16/17 ©Jesper Larsson Träff Typical victims of Amdahl‘s law:

•Sequential input/output could be a constant fraction •Sequential initialization of global data structures •Sequential processing of „hard-to-parallelize“ parts of algorithm, e.g., shared data structures

Amdahl‘s law limits speed-up in such cases, if they are a constant fraction of total time, independent of problem size

WS16/17 ©Jesper Larsson Träff Example:

1. Processor 0: read input, some precomputation 2. Split problem into n/p parts, send part i to processor i 10n 9n 3. All processors i: solve part i 4. All processors i: send partial solution back to processor 0

Amdahl: s=0.1, SU at most 10

Typical Amdahl, sequential bottleneck: Constant sequential fraction (2 out of 4 steps, limits speedup)

When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)

WS16/17 ©Jesper Larsson Träff Example: K iterations before convergence, (parallel) // Sequential initialization convergence check cheap, f(i) x = (int*)calloc(n*sizeof(int)); fast O(1)… … // Parallelizable part Tseq(n) = n+K+Kn do {

for (i=0; i

Sp(n) -> 1+K

WS16/17 ©Jesper Larsson Träff Example: K iterations before convergence, (parallel) // Sequential initialization convergence check cheap, f(i) x = (int*)malloc(n*sizeof(int)); fast O(1)… … // Parallelizable part Tseq(n) = 1+K+Kn do {

for (i=0; i p when n>p, n->∞ law does not limit SU

WS16/17 ©Jesper Larsson Träff Example: K iterations before convergence, (parallel) // Sequential initialization convergence check cheap, f(i) x = (int*)malloc(n*sizeof(int)); fast O(1)… … // Parallelizable part Tseq(n) = 1+K+Kn do {

for (i=0; i

Sp(n) -> p when n>p, n->∞ functions (calloc, malloc)

WS16/17 ©Jesper Larsson Träff Avoiding Amdahl: Scaled speedup

Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions)

The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p

Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n).

Assume t(n)/T(n) -> 0 for n-> ∞

Tpar(p,n) = t(n)+T(n)/p

WS16/17 ©Jesper Larsson Träff Speed-up as a function of p and n:

Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p for n -> ∞

Definition: Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up

Lesson: Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n

WS16/17 ©Jesper Larsson Träff Definition: The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n:

E(p,n) =

(Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))

cost Remarks:

•E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p •E(p,n) = constant: linear speedup •Cost-optimal algorithms have constant efficienty

WS16/17 ©Jesper Larsson Träff Scalability

Definition: A parallel algorithm/implementation is strongly scaling if

Sp(n) = Θ(p) (linear, independent of (sufficiently large) n)

Definition: A parallel algorithm/implementation is weakly scaling if there is a slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiency function

„Maintain efficiency by increasing problem size as f(p) or more“ J. Gustafson: Reevaluating Amdahls Law. CACM 1988 Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993) WS16/17 ©Jesper Larsson Träff Example:

// Sequential initialization Assume convergence check x = (int*)malloc(n*sizeof(int)); takes O(log p) time … // Parallelizable part do { Tpar(p,n) = Kn/p+K log p for (i=0; i

Weakly scalable, n has to increase as O(p log p) to maintain constant efficiency, O(log p) per processor (if balanced)

WS16/17 ©Jesper Larsson Träff Examples (Tpar, Speedup, Optimality, Efficiency):

Linear time computation, Tseq(n) = n (constants ignored)

Typical, good, work-optimal parallelizations

1. Tpar0(p,n) = n/p+1 Embarassingly “data parallel” computation, constant overhead 2. Tpar1(p,n) = n/p+log p logarithmic overhead, e.g. convergence check 3. Tpar2(p,n) = n/p+log2 p

4. Tpar3(p,n) = n/p+√p

5. Tpar4(p,n) = n/p+p Linear overhead, e.g. data exchange

WS16/17 ©Jesper Larsson Träff n=128

WS16/17 ©Jesper Larsson Träff Cost-optimality:

1. Tpar0(p,n) = n/p+1: p*Tpar(p,n) = n+p = O(n) for p=O(n)

2. Tpar1(p,n) = n/p+log p: p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)

3. Tpar2(p,n) = n/p+log2 p: p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)

4. Tpar3(p,n) = n/p+√p: p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)

5. Tpar4(p,n) = n/p+p: p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)

WS16/17 ©Jesper Larsson Träff n=128

WS16/17 ©Jesper Larsson Träff n=128, but p up to 256

WS16/17 ©Jesper Larsson Träff n=16384 (=16K=1282)

WS16/17 ©Jesper Larsson Träff n=2097152 (=2M=1283)

WS16/17 ©Jesper Larsson Träff n=128

WS16/17 ©Jesper Larsson Träff n=16384 (=16K=1272)

WS16/17 ©Jesper Larsson Träff n=2097152 (=2M=1273)

WS16/17 ©Jesper Larsson Träff Efficiency, weak scaling

To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as

1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p

2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)

3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p

4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)

5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2

WS16/17 ©Jesper Larsson Träff Maintained efficiency = 0.9 (90%)

WS16/17 ©Jesper Larsson Träff Matrix-vector multiplication parallelizations

Tseq(n) = n2

Tpar0(p,n) = n2/p + n Tpar1(p,n) = n2/p + n + log p

WS16/17 ©Jesper Larsson Träff n=100

WS16/17 ©Jesper Larsson Träff n=1000

WS16/17 ©Jesper Larsson Träff Some non work-optimal parallel algorithms

1. Tseq(n) = n log n Tpar(p,n) = n2/p+1

2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1

Amdahl case, linear sequential running time:

TparA(p,n) = 0.9n/p+0.1n

WS16/17 ©Jesper Larsson Träff n1 = 128 n2 = 16384

WS16/17 ©Jesper Larsson Träff Limitations of empirical speedup measure

Empirical speedup assumes thatTseq(n) can be measured.

For very large n and p, this may not be the case: a large HPC system has much more (distributed) main memory than any single-processor system

Scalability measured by other means: •Stepwise speedup (1-1000 processors, 1000-100,000 processors) •Other notions of efficiency

WS16/17 ©Jesper Larsson Träff Lecture summary, checklist

•Models of parallel computation: Architecture, programming, cost

•Flynn’s taxonomy: MIMD, SIMD; SPMD

•Sequential baseline

•Speedup (in theory and practice)

•Work, Cost optimality

•Amdahl’s law

•Scaled speed-up, strong and weak scaling

WS16/17 ©Jesper Larsson Träff